The general question is if it is possible to create a function that takes a list, applies a function to each item in that list, and then returns the results in a usable type (list, dataset, etc).
In my specific case, I have a dataset with several categorical variables, and I want to return a dataset that contains various summary statistics (mean, count, etc) as columns and rows are determined by each categorical variable. I can do with SQL by creating a dataset of a single row for each categorical variable and then iteratively combining these, but it is rather ugly and awkward. Is there a better way to do this?
For categorical variables you would use PROC FREQ to get counts
For numerical/continous variables you would use PROC MEANS. You've received serveral results for PROC MEANS, so I'll show you the PROC FREQ for categorical data.
And here's a way to list all the possible levels and counts:
https://gist.github.com/statgeek/e0903d269d4a71316a4e
Here's a quick way to do a table summary, with a single cross variable:
https://gist.github.com/statgeek/0c4aeec9053cf8050be18a03b842c1b9
Another good PROC to keep in your back pocket for summary tables is PROC TABULATE. It's pretty powerful for creating various summaries and probably under utilized.
@mark4 wrote:
The general question is if it is possible to create a function that takes a list, applies a function to each item in that list, and then returns the results in a usable type (list, dataset, etc).
In my specific case, I have a dataset with several categorical variables, and I want to return a dataset that contains various summary statistics (mean, count, etc) as columns and rows are determined by each categorical variable. I can do with SQL by creating a dataset of a single row for each categorical variable and then iteratively combining these, but it is rather ugly and awkward. Is there a better way to do this?
Please provide some example, with test data in the form of a datastep, and what the output should look like.
This: "The general question is if it is possible to create a function that takes a list, applies a function to each item in that list, and then returns the results in a usable type (list, dataset, etc)." - is what a datastep is.
Dataset in with list
Datastep runs function against that
Data out
I don't see how to reconcile your description of a data step with my problem. The data step, as far as I know, takes a list of datasets as input and produces a dataset, but that is not what I'm after here: My list is a set of categorical variables from a dataset and the function takes a variable and produces a dataset. So,
"Data Step": (dataset_1, ..., dataset_n) -> dataset,
but I want
"Wanted_Function": (variable, dataset) -> dataset
where variable is a column of the input dataset.
Since I have a working solution, I'm really interested in knowing if there are built-in functions or other solutions I might be missing that can clean up my code and the amount of datasets produced.
For example, if my dataset my_data looked like:
Sex Region Salary
M A 10000
M B 15000
F A 11000
F A 14000
I would want a dataset want_data that looked like
Category Count Avg_Salary
M 2 12500
F 2 12500
A 3 12500
B 1 11667
I can create this in SQL, but as I said, it's ugly - it builds a lot of intermediate datasets (here, just two - group_by_Sex, group_by_Region) that I don't want to keep and the union statement gets messy when you start incorporating more and more categorical variables.
%macro summary_table(categorical_variable);
create table group_by_&categorical_variable as select
&categorical_variable as category, count() as count, avg(Salarly) as Avg_Salary
from my_data group by &categorical_varialble;
%mend;
proc sql noprint;
%summary_table(Sex);
%summary_table(Region);
create table want_data as (select * from group_by_Sex) union (select * from group_by_Region);
quit;
data have; input Sex $ Region $ Salary; datalines; M A 10000 M B 15000 F A 11000 F A 14000 ; run; proc summary data=have; class sex region; var salary; output out=want sum=sum; run;
That gives the sums for sex/region and combinations. I can't find the post from data_null, but he had presented two options to reduce the number of the output, let me see if I can find it.
Ah, here we go:
@mark4 wrote:
I don't see how to reconcile your description of a data step with my problem.
The request for a data step that demonstrates the structure of the data set and names and types of variables is so we can provide examples that match you situation closer. A data step allows us to create a data set that code can be tested with.
It often saves a great many posts when a suggested solution is provided but then the original poster comes back and "says that didn't work", followed by questions determining why not. Then finding out that one of the variables was character and such concepts as "mean" and "standard deviation" don't actually apply. Also, a fair number of questions here data that is poorly structured for a specific task. With an example we can show how to restructure the data as well as get the solution steps.
Maybe something like this. I like the STACKODS output from PROC MEANS for summary stats.
data class;
   set sashelp.class;
   trt = rantbl(456789,.4);
   run;
proc print;
   run;
ods select none;
ods output summary=summary;
proc means data=class chartype descendtypes stackods missing n mean stddev median min max;
   class age sex trt / mlf;
   types trt*(age sex);
   run;
ods select all;
proc print;
   run;
For categorical variables you would use PROC FREQ to get counts
For numerical/continous variables you would use PROC MEANS. You've received serveral results for PROC MEANS, so I'll show you the PROC FREQ for categorical data.
And here's a way to list all the possible levels and counts:
https://gist.github.com/statgeek/e0903d269d4a71316a4e
Here's a quick way to do a table summary, with a single cross variable:
https://gist.github.com/statgeek/0c4aeec9053cf8050be18a03b842c1b9
Another good PROC to keep in your back pocket for summary tables is PROC TABULATE. It's pretty powerful for creating various summaries and probably under utilized.
@mark4 wrote:
The general question is if it is possible to create a function that takes a list, applies a function to each item in that list, and then returns the results in a usable type (list, dataset, etc).
In my specific case, I have a dataset with several categorical variables, and I want to return a dataset that contains various summary statistics (mean, count, etc) as columns and rows are determined by each categorical variable. I can do with SQL by creating a dataset of a single row for each categorical variable and then iteratively combining these, but it is rather ugly and awkward. Is there a better way to do this?
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
