Obsidian | Level 7

Function over a List - Create dataset with various groupping statistics

The general question is if it is possible to create a function that takes a list, applies a function to each item in that list, and then returns the results in a usable type (list, dataset, etc).

In my specific case, I have a dataset with several categorical variables, and I want to return a dataset that contains various summary statistics (mean, count, etc) as columns and rows are determined by each categorical variable.  I can do with SQL by creating a dataset of a single row for each categorical variable and then iteratively combining these, but it is rather ugly and awkward.  Is there a better way to do this?

1 ACCEPTED SOLUTION

Accepted Solutions
Super User

Re: Function over a List - Create dataset with various groupping statistics

For categorical variables you would use PROC FREQ to get counts

For numerical/continous variables you would use PROC MEANS. You've received serveral results for PROC MEANS, so I'll show you the PROC FREQ for categorical data.

And here's a way to list all the possible levels and counts:

https://gist.github.com/statgeek/e0903d269d4a71316a4e

Here's a quick way to do a table summary, with a single cross variable:

https://gist.github.com/statgeek/0c4aeec9053cf8050be18a03b842c1b9

Another good PROC to keep in your back pocket for summary tables is PROC TABULATE. It's pretty powerful for creating various summaries and probably under utilized.

@mark4 wrote:

The general question is if it is possible to create a function that takes a list, applies a function to each item in that list, and then returns the results in a usable type (list, dataset, etc).

In my specific case, I have a dataset with several categorical variables, and I want to return a dataset that contains various summary statistics (mean, count, etc) as columns and rows are determined by each categorical variable.  I can do with SQL by creating a dataset of a single row for each categorical variable and then iteratively combining these, but it is rather ugly and awkward.  Is there a better way to do this?

7 REPLIES 7
Diamond | Level 26

Re: Function over a List - Create dataset with various groupping statistics

Please provide some example, with test data in the form of a datastep, and what the output should look like.

This: "The general question is if it is possible to create a function that takes a list, applies a function to each item in that list, and then returns the results in a usable type (list, dataset, etc)." - is what a datastep is.

Dataset in with list

Datastep runs function against that

Data out

Obsidian | Level 7

Re: Function over a List - Create dataset with various groupping statistics

I don't see how to reconcile your description of a data step with my problem.  The data step, as far as I know, takes a list of datasets as input and produces a dataset, but that is not what I'm after here:  My list is a set of categorical variables from a dataset and the function takes a variable and produces a dataset.  So,

"Data Step": (dataset_1, ..., dataset_n) -> dataset,

but I want

"Wanted_Function": (variable, dataset) -> dataset

where variable is a column of the input dataset.

Since I have a working solution, I'm really interested in knowing if there are built-in functions or other solutions I might be missing that can clean up my code and the amount of datasets produced.

For example, if my dataset my_data looked like:

Sex    Region   Salary

M       A             10000

M       B             15000

F        A             11000

F        A             14000

I would want a dataset want_data that looked like

Category   Count   Avg_Salary

M               2          12500

F               2          12500

A               3          12500

B               1          11667

I can create this in SQL, but as I said, it's ugly - it builds a lot of intermediate datasets (here, just two - group_by_Sex, group_by_Region) that I don't want to keep and the union statement gets messy when you start incorporating more and more categorical variables.

%macro summary_table(categorical_variable);

create table group_by_&categorical_variable as select

&categorical_variable as category, count() as count, avg(Salarly) as Avg_Salary

from my_data group by  &categorical_varialble;

%mend;

proc sql noprint;

%summary_table(Sex);

%summary_table(Region);

create table want_data as (select * from group_by_Sex) union (select * from group_by_Region);

quit;

Diamond | Level 26

Re: Function over a List - Create dataset with various groupping statistics

data have;
input Sex \$ Region \$ Salary;
datalines;
M       A             10000
M       B             15000
F        A             11000
F        A             14000
;
run;

proc summary data=have;
class sex region;
var salary;
output out=want sum=sum;
run;

That gives the sums for sex/region and combinations.  I can't find the post from data_null, but he had presented two options to reduce the number of the output, let me see if I can find it.

Ah, here we go:

https://communities.sas.com/t5/SAS-Programming/Summary-table-with-categories-from-multiple-vars/m-p/...

Super User

Re: Function over a List - Create dataset with various groupping statistics

@mark4 wrote:

I don't see how to reconcile your description of a data step with my problem.

The request for a data step that demonstrates the structure of the data set and names and types of variables is so we can provide examples that match you situation closer. A data step allows us to create a data set that code can be tested with.

It often saves a great many posts when a suggested solution is provided but then the original poster comes back and "says that didn't work", followed by questions determining why not. Then finding out that one of the variables was character and such concepts as "mean" and "standard deviation" don't actually apply. Also, a fair number of questions here data that is poorly structured for a specific task. With an example we can show how to restructure the data as well as get the solution steps.

Obsidian | Level 7

Re: Function over a List - Create dataset with various groupping statistics

No, I get that. My intent wasn't so much a solution as general principals to get to a solution so I was trying to be as general as possible. Having the various options discussed below is very helpful for me since a lot of SAS requires you to know which procedures are even out there for you to use.

Re: Function over a List - Create dataset with various groupping statistics

Maybe something like this.  I like the STACKODS output from PROC MEANS for summary stats.

data class;
set sashelp.class;
trt = rantbl(456789,.4);
run;
proc print;
run;
ods select none;
ods output summary=summary;
proc means data=class chartype descendtypes stackods missing n mean stddev median min max;
class age sex trt / mlf;
types trt*(age sex);
run;
ods select all;
proc print;
run;

Super User

Re: Function over a List - Create dataset with various groupping statistics

For categorical variables you would use PROC FREQ to get counts

For numerical/continous variables you would use PROC MEANS. You've received serveral results for PROC MEANS, so I'll show you the PROC FREQ for categorical data.

And here's a way to list all the possible levels and counts:

https://gist.github.com/statgeek/e0903d269d4a71316a4e

Here's a quick way to do a table summary, with a single cross variable:

https://gist.github.com/statgeek/0c4aeec9053cf8050be18a03b842c1b9

Another good PROC to keep in your back pocket for summary tables is PROC TABULATE. It's pretty powerful for creating various summaries and probably under utilized.

@mark4 wrote:

The general question is if it is possible to create a function that takes a list, applies a function to each item in that list, and then returns the results in a usable type (list, dataset, etc).

In my specific case, I have a dataset with several categorical variables, and I want to return a dataset that contains various summary statistics (mean, count, etc) as columns and rows are determined by each categorical variable.  I can do with SQL by creating a dataset of a single row for each categorical variable and then iteratively combining these, but it is rather ugly and awkward.  Is there a better way to do this?

Discussion stats
• 7 replies
• 1645 views
• 3 likes
• 5 in conversation