Hi, I have a dataset contain the age group and income info. I want to create a dataset from it, counting the cumulative income level by age group. For example, for each age group, I want to know how many people have income <= 1000, <= 2000, ... etc.
I tried using proc sql - group by function (on age group) along with sum - case when on income, but I got all the income level as column instead of rows. Is there an efficient way to do this?
HAVE:
ID | Age_Group | Income |
1 | 21-30 | 1000 |
2 | 51-60 | 1200 |
3 | 31-40 | 1400 |
4 | 51-60 | 2020 |
5 | 21-30 | 3000 |
6 | 51-60 | 3000 |
WANT:
Age_Group | Income | Cumulative_count |
21-30 | 1000 | 1 |
21-30 | 2000 | 1 |
21-30 | 3000 | 2 |
31-40 | 1000 | 0 |
31-40 | 2000 | 1 |
31-40 | 3000 | 1 |
51-60 | 1000 | 0 |
51-60 | 2000 | 2 |
51-60 | 3000 | 3 |
Assuming you know the start/end of the ranges of income you'd like to represent in your data, I would recommend something like this:
data income_fmt;
fmtname='income_fmt';
type='N';
do i=1000 to 10000 by 1000;
start=i;
end=i+1000 - 1;
label=i;
output;
end;
run;
proc format cntlin=income_fmt;
run;
data have;
infile cards dlm='09'x;
input ID $ Age_Group $ Income;
cards;
1 21-30 1000
2 51-60 1200
3 31-40 1400
4 51-60 2020
5 21-30 3000
6 51-60 3000
6 51-60 5000
;
;
;;
run;
proc sort data=have;
by age_group;
run;
proc freq data=have noprint;
by age_group;
table income/ list outcum sparse out=want;
format income income_fmt.;
run;
proc print data=want;
run;
CAVEAT: If you're missing a range, ie 3000 across all categories it will not print in the results
There are some errors in your want output. Anyway to get the correct cumulative counts, use a multilabel format with proc summary.
data have;
infile cards dlm='09'x;
input ID Age_Group $ Income;
cards;
1 21-30 1000
2 51-60 1200
3 31-40 1400
4 51-60 2020
5 21-30 3000
6 51-60 3000
;
run;
proc format;
value inclvl (multilabel)
low - 1000 = '<= $1000'
low - 2000 = '<= $2000'
low - 3000 = '<= $3000'
;
run;
proc summary data=have nway completetypes;
class age_group;
class income / mlf order=formatted;
format income inclvl.;
var id;
output out=want (drop=_:) n=cum_count;
run;
data have;
infile cards expandtabs;
input ID Age_Group $ Income;
cards;
1 21-30 1000
2 51-60 1200
3 31-40 1400
4 51-60 2020
5 21-30 3000
6 51-60 3000
;
run;
data level;
do income=1000 to 3000 by 1000;
output;
end;
run;
proc sql;
create table want as
select *,(select count(*) from have where Age_Group=a.Age_Group and income<=a.income) as Cum_count
from (select * from (select distinct Age_Group from have),(select income from level)) as a
order by 1,2;
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.