DATA Step, Macro, Functions and more

summary data from data set

Accepted Solution Solved
Reply
Regular Contributor
Posts: 216
Accepted Solution

summary data from data set

Hi

I am looking for a way to summarize the data in a data set. Is there a way to create another data set that summarizes records in the data set from another? In other words, I have a data set that contains the months it takes to achieve certain events and these are stored as columns, along with a county field and a unique record indicator. I would like to group the times it takes to achieve these events by month groupings (i.e. 0-6, 7-12, 13-18, etc.) and then summarize these by sums, percentages by county.

Is there a way to do this?

Paul


Accepted Solutions
Solution
‎11-08-2012 10:57 AM
PROC Star
Posts: 7,363

Re: summary data from data set

Paul, As I had mentioned, earlier, you will get better responses if you provide an example dataset.

Without knowing what your data looks like, all any of us can do is guess.

Your output looks like a result that could easily be obtained using a combination of proc freq and proc transpose.  e.g., the following comes close:

proc format;

  value timespan

    low-0.084931=1

    0.0849310001-0.167123=2

    0.1671230001-0.252054=3

    other=4;

run;

data have;

  informat cnty_name $20.;

  informat description $30.;

  informat start_date date9.;

  format start_date date9.;

  input id cnty_name description age_range year start_date;

  time_in_program=yrdif(start_date, today(), 'AGE');

  cards;

1 Hopeland Outcome-2940 1 2006 8jan2012

2 Hopeland Outcome-2940 1 2006 8feb2012

3 Hopeland Outcome-2940 2 2006 8aug2012

4 Hopeland Outcome-2940 2 2006 8sep2012

5 Hopeland Outcome-2940 2 2006 8oct2012

1 Nohopeland Outcome-2940 1 2006 8jan2012

2 Nohopeland Outcome-2940 1 2006 8feb2012

3 Nohopeland Outcome-2940 3 2006 8aug2012

4 Nohopeland Outcome-2940 3 2006 8sep2012

5 Nohopeland Outcome-2940 3 2006 8oct2012

;

proc freq data=have noprint;

  tables cnty_name*age_range*time_in_program/out=test;

  format time_in_program timespan.;

run;

proc transpose data=test out=want (drop=_label_

            rename=(_name_=DataElement))prefix=months;

  by cnty_name age_range;

  idlabel time_in_program;

run;

View solution in original post


All Replies
PROC Star
Posts: 7,363

Re: summary data from data set

Paul,  sounds simple enough but it will help if you provide example have and want datasets, preferably in the form of datasteps.

Regular Contributor
Posts: 216

Re: summary data from data set

For this, I don't have much of a data step to do it actually. I am not even sure it can be done. The existing data table (s1) has the following columns (among many others, but the below are key):

entity_id (unique id)

age_category (1-4 indicator of the age range)

county (county)

ExitMonths (the time in months it takes to reach the court milestone)

ExitMonthCategory (either a 6, 12, 18, 24 indicator)

startYear (year child entered the court system)

What I would like to do is to export a new data set (ultimately to Excel) creating columns for month ranges (i.e. 0-6, 7-12, 13-18, etc.) and rows for summarized measures and dimensions (i.e. county, age range, number, percent, cumulative percent). The row dimensions would summarize the above ExitMonths column from the s1 data table.

I tried using Enterprise Guide to run a frequency over the s1 table, but could not get what I want.

Any ideas?

Paul

Super User
Posts: 10,500

Re: summary data from data set

What type of summary are you looking for, counts, percents, means or something else?

You may want to manually create and post an example of the final output desired appearance.

It may be that Proc Tabulate or Report would work as they allow many summarization options. The output can be set to Excel using ODS TAGSETS.EXCELXP and even result in semi-pretty tables.

Regular Contributor
Posts: 216

Re: summary data from data set

I am looking for number, cumulative number, percent, cumulative percent and total. I am attaching a link to an example of what the ultimate output from SAS would look like in Excel.

So I am I have a data table that has many child records per county and I need to summarize these records by county, and then by things like age range, year--all the dimensions that are in the link to the attached sample file.

I don't know how to summarize a data table though. I can do some things in the Enterprise Guide GUI with Procs, but I don't get everything and it is not as neatly organized as the linked example.

Any ideas on approach, etc. would be appreciated.

http://www.nycourts.gov/surveys/cwcip/metric1-Example.zip

Paul

Solution
‎11-08-2012 10:57 AM
PROC Star
Posts: 7,363

Re: summary data from data set

Paul, As I had mentioned, earlier, you will get better responses if you provide an example dataset.

Without knowing what your data looks like, all any of us can do is guess.

Your output looks like a result that could easily be obtained using a combination of proc freq and proc transpose.  e.g., the following comes close:

proc format;

  value timespan

    low-0.084931=1

    0.0849310001-0.167123=2

    0.1671230001-0.252054=3

    other=4;

run;

data have;

  informat cnty_name $20.;

  informat description $30.;

  informat start_date date9.;

  format start_date date9.;

  input id cnty_name description age_range year start_date;

  time_in_program=yrdif(start_date, today(), 'AGE');

  cards;

1 Hopeland Outcome-2940 1 2006 8jan2012

2 Hopeland Outcome-2940 1 2006 8feb2012

3 Hopeland Outcome-2940 2 2006 8aug2012

4 Hopeland Outcome-2940 2 2006 8sep2012

5 Hopeland Outcome-2940 2 2006 8oct2012

1 Nohopeland Outcome-2940 1 2006 8jan2012

2 Nohopeland Outcome-2940 1 2006 8feb2012

3 Nohopeland Outcome-2940 3 2006 8aug2012

4 Nohopeland Outcome-2940 3 2006 8sep2012

5 Nohopeland Outcome-2940 3 2006 8oct2012

;

proc freq data=have noprint;

  tables cnty_name*age_range*time_in_program/out=test;

  format time_in_program timespan.;

run;

proc transpose data=test out=want (drop=_label_

            rename=(_name_=DataElement))prefix=months;

  by cnty_name age_range;

  idlabel time_in_program;

run;

Regular Contributor
Posts: 216

Re: summary data from data set

Hi Arthur

Two questions among many, but two to start:

1. How are you coming up with the month ranges along the top of the spreadsheet? From the code, I am not seeing how this is determined.

2. The first proc format does what?

Paul

PROC Star
Posts: 7,363

Re: summary data from data set

Your second question answers your first question.  The proc format converts the age calculation into months.  I only did it for months 1, 2 and 3, and formatted all other "ages" to be 4, as I didn't know if this approach was what you were looking for or if your data was just dates like those I used as a sample dataset.

I calculated time in program with the statement: time_in_program=yrdif(start_date, today(), 'AGE');

Then, applied the format when I ran proc freq with the statement: format time_in_program timespan.;

Regular Contributor
Posts: 216

Re: summary data from data set

My data ranges (at top of the spreadsheet) are just individual months or 6 month groupings. How do they correspond to the below values (i.e. -0.084931)?

proc format;

  value timespan

    low-0.084931=1

    0.0849310001-0.167123=2

    0.1671230001-0.252054=3

    other=4;

run;

PROC Star
Posts: 7,363

Re: summary data from data set

Depends on your actual desired cutoffs.  I was approximating 1/12th of a year.  Actually, it should have been 0.08333 for one month, 0.166667 for two months, 0.25 for 3 months, etc.

Regular Contributor
Posts: 216

Re: summary data from data set

1. what does 'low' refer to in proc format?

2. How are you automatically classifying the values in time in program as belonging to one of the month categories?

Paul

PROC Star
Posts: 7,363

Re: summary data from data set

Paul, Low means the lowest value that SAS can represent (excluding missing values).

You didn't provide any example data, thus I built my example on the premise that you had a start date, which I called start_date.

I then calculated time_in_program as yrdif(start_date, today(), 'AGE');  What that calculated is the difference between start_date and today(), expressed as a real number, similar to how one would calculate someone's age.  If the difference was precisely one month, the resulting number would be 0.08333If the difference was precisely one year, the resulting number would be 1.0.


I created and applied a format to classify those calculations into time groups.


☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 365 views
  • 0 likes
  • 3 in conversation