Hi, I have a dataset contain the age group and income info. I want to create a dataset from it, counting the cumulative income level by age group. For example, for each age group, I want to know how many people have income <= 1000, <= 2000, ... etc.
I tried using proc sql - group by function (on age group) along with sum - case when on income, but I got all the income level as column instead of rows. Is there an efficient way to do this?
HAVE:
ID | Age_Group | Income |
1 | 21-30 | 1000 |
2 | 51-60 | 1200 |
3 | 31-40 | 1400 |
4 | 51-60 | 2020 |
5 | 21-30 | 3000 |
6 | 51-60 | 3000 |
WANT:
Age_Group | Income | Cumulative_count |
21-30 | 1000 | 1 |
21-30 | 2000 | 1 |
21-30 | 3000 | 2 |
31-40 | 1000 | 0 |
31-40 | 2000 | 1 |
31-40 | 3000 | 1 |
51-60 | 1000 | 0 |
51-60 | 2000 | 2 |
51-60 | 3000 | 3 |
Assuming you know the start/end of the ranges of income you'd like to represent in your data, I would recommend something like this:
data income_fmt;
fmtname='income_fmt';
type='N';
do i=1000 to 10000 by 1000;
start=i;
end=i+1000 - 1;
label=i;
output;
end;
run;
proc format cntlin=income_fmt;
run;
data have;
infile cards dlm='09'x;
input ID $ Age_Group $ Income;
cards;
1 21-30 1000
2 51-60 1200
3 31-40 1400
4 51-60 2020
5 21-30 3000
6 51-60 3000
6 51-60 5000
;
;
;;
run;
proc sort data=have;
by age_group;
run;
proc freq data=have noprint;
by age_group;
table income/ list outcum sparse out=want;
format income income_fmt.;
run;
proc print data=want;
run;
CAVEAT: If you're missing a range, ie 3000 across all categories it will not print in the results
There are some errors in your want output. Anyway to get the correct cumulative counts, use a multilabel format with proc summary.
data have;
infile cards dlm='09'x;
input ID Age_Group $ Income;
cards;
1 21-30 1000
2 51-60 1200
3 31-40 1400
4 51-60 2020
5 21-30 3000
6 51-60 3000
;
run;
proc format;
value inclvl (multilabel)
low - 1000 = '<= $1000'
low - 2000 = '<= $2000'
low - 3000 = '<= $3000'
;
run;
proc summary data=have nway completetypes;
class age_group;
class income / mlf order=formatted;
format income inclvl.;
var id;
output out=want (drop=_:) n=cum_count;
run;
data have;
infile cards expandtabs;
input ID Age_Group $ Income;
cards;
1 21-30 1000
2 51-60 1200
3 31-40 1400
4 51-60 2020
5 21-30 3000
6 51-60 3000
;
run;
data level;
do income=1000 to 3000 by 1000;
output;
end;
run;
proc sql;
create table want as
select *,(select count(*) from have where Age_Group=a.Age_Group and income<=a.income) as Cum_count
from (select * from (select distinct Age_Group from have),(select income from level)) as a
order by 1,2;
quit;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.