- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
imagine a dataset and format like this:
data have;
input my_category $ value;
datalines;
A 1
A 0
A 3
B 7
B 3
B 1
;
run;
proc format ;
value $total_format
other = "T"
;
run;
I want to aggregate over my_category, using a custom format (i know, in this simple example I could simply remove my_category from CLASS, but I need to have the column my_category = T in the output dataset).
So I use proc summary:
proc summary nway data = have;
class my_category;
var value;
format my_category $total_format.;
output out = have_summed sum = ;
run;
this produces:
my_category _TYPE_ _FREQ_ value T 1 6 15
This is expected, so far so good. However, my_category retains the format $total_format, instead of... hardcoding it.
That means that if you remove the format with the following:
data have2;
set have_summed;
format my_category;
run;
the result is:
my_category _TYPE_ _FREQ_ value A 1 6 15
i.e. "A" instead of "T", which is simply wrong!
In have_summed, the column my_category appears to have the value "T".
If you export have_summed to a CSV, the formatted value "T" is retained. But if you use for example the CATS/CATX function with my_category, it does not use "T" but "A", i.e. the unformatted value. I find this very confusing. What am I missing? Is there a better way?
Platform SAS 9.4.7.0.
- Tags:
- format
- proc summary
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you want PROC SUMMARY to convert formatted values to character using value-labels you can add the multilabel-format option MLF to the class statement;
data have;
input my_category $ value;
datalines;
A 1
A 0
A 3
B 7
B 3
B 1
;
run;
proc format ;
value $total_format other="Total";
quit;
proc summary nway data = have;
class my_category / mlf;
var value;
format my_category $total_format.;
output out=have_summed sum=;
run;
proc contents varnum;
proc print;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It's not clear what you are trying to do in the last step
the result is:
my_category _TYPE_ _FREQ_ value A 1 6 15
i.e. "A" instead of "T", which is simply wrong!
but this is correct. Formatting doesn't change the underlying value. So when you remove the format, the underlying value is returned. Your earlier run of PROC SUMMARY puts all the value of my_category into one category which appears as "T" and the underlying value (unformatted) is the value of "A" (which is what PROC SUMMARY does when it combines categories, it uses the first value alphabetically or numerically to represent combined categories).
So what are you trying to do here?
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @PaigeMiller, thanks for your answer.
I find it hard to believe that when aggregating over values, proc summary takes the first value as a label. I cannot think of a situation where that can be useful and not misleading... From your comment I understand that is the expected behaviour?
When you look at the last dataset, you see that the total for category "A" is 15. That is wrong.
I would not care as much and simply keep using the formatted value, except that in some procedures or functions SAS takes the formatted values, sometimes the unformatted value.
What I want is really to force the proc summary to use the format only once in the aggregation, not to assign it permanently to the variable while keeping the unformatted values.
Something like the put function in a data step:
new_var = put(other_var, $my_format.);
In this case new_var is not assigned permanently $my_format.
Maybe my whole approach is not SAS-y enough, maybe there is a better way. I just cant think of any simple way...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@PavelD wrote:
Hello @PaigeMiller, thanks for your answer.
I find it hard to believe that when aggregating over values, proc summary takes the first value as a label. I cannot think of a situation where that can be useful and not misleading... From your comment I understand that is the expected behaviour?
What should PROC SUMMARY call a combined group? Well, anyway, yes, it picks the first value alphabetically or numerically in the combined group.
When you look at the last dataset, you see that the total for category "A" is 15. That is wrong.
Actually, your idea of assigning formats and removing them is causing SAS to say "A" has 15. SAS does what you tell it to do. So its not SAS that's wrong, it's your method that causes the wrong answer.
What I want is really to force the proc summary to use the format only once in the aggregation, not to assign it permanently to the variable while keeping the unformatted values.
So, when it does this to do the aggregation, what value should it assign to my_category? I'm really not following what you want here.
In addition, let's move away from the actual code for a second and explain the analysis you are trying to do from a higher level, with no discussion of code. You have data, you want a certain result, what is that desired result (without mentioning code)? Show us the desired result.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In addition, let's move away from the actual code for a second and explain the analysis you are trying to do from a higher level, with no discussion of code. You have data, you want a certain result, what is that desired result (without mentioning code)? Show us the desired result.
Good point, I should have started with this in the first place. I appreciate your patience, @PaigeMiller and @Kurt_Bremser and others.
Maybe my choice of procs will turn out to be wrong completely.
So, I have raw data, let's say 10 columns, of which one is numeric (the VAR in proc summary). Of this raw data I need to create dozens of partially aggregated datasets, using predefined set of custom formats which I apply to 2-4 columns each time. I need to maintain the structure of the raw dataset, i.e. the 9+1 columns, so I put all 9 in the CLASS. My other idea was that I could have did some remerging after proc summary but I did not like that option, I always try to look for the simplest option. So now I have dozens of aggregated datasets, since they all have the same structure, I wanted to SET them together and create a single dataset, but that failed completely because each dimension adopted the zombie format from the first SET dataset, and as I wrote above the formats differed in each aggregation. Ultimately I wanted to export it into CSV, so I simply exported dozens of CSVs instead of one. This worked. What I meant was that in my situation, it would simply be easier to hardcode formats after each aggregation, perhaps with VVALUE as another member suggested, I thought about that, but I was just hoping there is a neater, cleaner option. If you have any idea or general remarks, I am all ears. Thanks!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm sure that PROC SUMMARY could work here, but I'm still not clear what you want (for example, I don't understand what you mean by partially aggregated data sets) and again I request that you SHOW ME the desired output.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@PavelD wrote:
Thanks @PaigeMiller, I am sorry I cannot provide more details, nor the desired result, I think it would be a lots of tables and I dont think it would bring much value. Thanks for your help!!
You could show us the desired output for your example that your provided. But you haven't done that.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@PavelD I'm not able to follow your description of the ultimate goal. My fault not yours. If you would start a new tread using the have/need scenario I expect you to get many excellent answers that may or may not involve PROC SUMMARY.
@PavelD wrote:
In addition, let's move away from the actual code for a second and explain the analysis you are trying to do from a higher level, with no discussion of code. You have data, you want a certain result, what is that desired result (without mentioning code)? Show us the desired result.
Good point, I should have started with this in the first place. I appreciate your patience, @PaigeMiller and @Kurt_Bremser and others.
Maybe my choice of procs will turn out to be wrong completely.
So, I have raw data, let's say 10 columns, of which one is numeric (the VAR in proc summary). Of this raw data I need to create dozens of partially aggregated datasets, using predefined set of custom formats which I apply to 2-4 columns each time. I need to maintain the structure of the raw dataset, i.e. the 9+1 columns, so I put all 9 in the CLASS. My other idea was that I could have did some remerging after proc summary but I did not like that option, I always try to look for the simplest option. So now I have dozens of aggregated datasets, since they all have the same structure, I wanted to SET them together and create a single dataset, but that failed completely because each dimension adopted the zombie format from the first SET dataset, and as I wrote above the formats differed in each aggregation. Ultimately I wanted to export it into CSV, so I simply exported dozens of CSVs instead of one. This worked. What I meant was that in my situation, it would simply be easier to hardcode formats after each aggregation, perhaps with VVALUE as another member suggested, I thought about that, but I was just hoping there is a neater, cleaner option. If you have any idea or general remarks, I am all ears. Thanks!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Great idea, @PavelD
Please focus on where you want to go with this analysis rather than on specific code. Examples with input data and output results are extremely helpful. I strongly suspect that by starting the discussion with your formatting problem, that we had entered what is called the XY problem.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
SAS retains the raw value because the variable will also retain its type, so if your category was numeric, but formatted to character values, the formatted values would not be possible:
data have;
input my_category value;
datalines;
1 1
1 0
1 3
2 7
2 3
2 1
;
proc format ;
value total_format
other = "T"
;
run;
proc means nway data=have sum;
class my_category;
var value;
format my_category total_format.;
output out=want sum=;
run;
my_category in WANT is numeric, so the format is needed, and the lowest numeric value for the group is kept.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For reasons already stated, using a PUT with the format instead of removing the format is the way to go.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you want PROC SUMMARY to convert formatted values to character using value-labels you can add the multilabel-format option MLF to the class statement;
data have;
input my_category $ value;
datalines;
A 1
A 0
A 3
B 7
B 3
B 1
;
run;
proc format ;
value $total_format other="Total";
quit;
proc summary nway data = have;
class my_category / mlf;
var value;
format my_category $total_format.;
output out=have_summed sum=;
run;
proc contents varnum;
proc print;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content