Dear Community
I’m looking for a program for the following.
What I have is a table HAVE. It has 3 fields: CD_YEAR, CD_MONTH, var_A.
This table contains 50 records:
The first 20 records are all CD_YEAR = 2018 and CD_MONTH = 1
The next 30 records are all CD_YEAR = 2018 and CD_MONTH = 2
What I need is a table WANT, where for var_A, a value is filled in following a statistical distribution.
I’m simplifying here: let’s say the distribution is:
50% => var_A = 10
25% => var_A = 15
25% => var_A = 20
So in table WANT, the needed result is:
First 10 records (cd_year= 2018, cd_month = 1) => var_A = 10
Next 5 records (cd_year= 2018, cd_month = 1) => var_A = 15
Next 5 records (cd_year= 2018, cd_month = 1) => var_A = 20
Next 15 records (cd_year= 2018, cd_month = 2) => var_A = 10
Next 7 or 8 records (cd_year= 2018, cd_month = 2) => var_A = 15
Next 7 or 8 records (cd_year= 2018, cd_month = 2) => var_A = 20
Notice the last 2 lines here above: as 25% of 30 records is not a whole number (7,5), it should take 7 or 8 records for var_A = 15. By consequence, it should take resp. 8 or 7 records for var_A = 20.
I hope I explained it clearly enough so that anyone could understand what I want.
I can’t see a method how to implement this.
Any hints or directions how to deal with this would be very appreciated.
(the real distribution is a little more complicated, with 13 percentages - I used the above example just to illustrate)
Thanks
Hello @fre,
This creates your dataset HAVE (without var_A, for which you haven't provided values).
data have;
cd_year=2018; /* In "example wanted.xlsx" it's 2016. */
do _n_=1 to 50;
cd_month=1+(_n_>20);
output;
end;
run;
I'd recommend that you store your "statistical distribution" in a dataset like this:
data dist;
input prop val;
cards;
0.5 10
0.25 15
0.25 20
;
Then you can create dataset WANT including variable var_A (or value_A, as it is called in "example wanted.xlsx") as follows:
/* Create format for cumulative proportions */
data cpfmt(drop=prop);
retain fmtname 'cumprop' sexcl 'N';
set dist(rename=(val=label));
end+prop;
output;
sexcl='Y';
start+prop;
run;
proc format cntlin=cpfmt;
run;
/* Add variable var_A */
data want(drop=_:);
do until(last.cd_month);
set have;
by cd_year cd_month;
_denom=sum(_denom,1);
end;
do until(last.cd_month);
set have;
by cd_year cd_month;
_num=sum(_num,1);
var_A=input(put(_num/_denom,cumprop.),8.);
output;
end;
run;
Edit 1: minor simplification in definition of var_A.
Edit 2: Changed BY statements to
by cd_year cd_month;
in order to allow HAVE datasets containing more than one value of cd_year (sorted by cd_year cd_month).
Hello @fre,
This creates your dataset HAVE (without var_A, for which you haven't provided values).
data have;
cd_year=2018; /* In "example wanted.xlsx" it's 2016. */
do _n_=1 to 50;
cd_month=1+(_n_>20);
output;
end;
run;
I'd recommend that you store your "statistical distribution" in a dataset like this:
data dist;
input prop val;
cards;
0.5 10
0.25 15
0.25 20
;
Then you can create dataset WANT including variable var_A (or value_A, as it is called in "example wanted.xlsx") as follows:
/* Create format for cumulative proportions */
data cpfmt(drop=prop);
retain fmtname 'cumprop' sexcl 'N';
set dist(rename=(val=label));
end+prop;
output;
sexcl='Y';
start+prop;
run;
proc format cntlin=cpfmt;
run;
/* Add variable var_A */
data want(drop=_:);
do until(last.cd_month);
set have;
by cd_year cd_month;
_denom=sum(_denom,1);
end;
do until(last.cd_month);
set have;
by cd_year cd_month;
_num=sum(_num,1);
var_A=input(put(_num/_denom,cumprop.),8.);
output;
end;
run;
Edit 1: minor simplification in definition of var_A.
Edit 2: Changed BY statements to
by cd_year cd_month;
in order to allow HAVE datasets containing more than one value of cd_year (sorted by cd_year cd_month).
Hi FreelanceReinhard,
thank you a million times for your code. It's far above my skilsl, but I think I can adapt it for my use.
In the example, year and month are already sorted (ascending) in the first place.
If I would start from an unsorted dataset, should I first sort it (proc sort), or does this code works without sorting?
EDIT: I've managed to apply your code to my original dataset. Such a beautiful and efficient code. I've learned a lot today. I'm very thankful to this Community such free support exists, and of course in particular to you, Reinhard!
Hi @fre, you're welcome. Yes, the HAVE dataset must be sorted (or indexed) by cd_year cd_month, because the final DATA step uses SET in conjunction with BY statements. I have extended the BY statements (see edited code) in order to allow HAVE datasets containing more than one value of cd_year , for example:
data have;
do cd_year=2016 to 2018;
do _n_=1 to 50;
cd_month=1+(_n_>20);
output;
end;
end;
run;
So, if HAVE is not yet sorted, use PROC SORT first:
proc sort data=have;
by cd_year cd_month;
run;
@fre wrote:
It's far above my skill, but I think I can adapt it for my use.
This sounds a bit risky. Please don't hesitate to ask if you have further questions. I'll be happy to check your adaptations if you like.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.