BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fre
Quartz | Level 8 fre
Quartz | Level 8

Dear Community

 

I’m looking for a program for the following.

 

What I have is a table HAVE.  It has 3 fields:  CD_YEAR, CD_MONTH, var_A.  

This table contains 50 records:

The first 20 records are all CD_YEAR = 2018 and CD_MONTH = 1

The next 30 records are all CD_YEAR = 2018 and CD_MONTH = 2

 

What I need is a table WANT, where for var_A,  a value is filled in following a statistical distribution.

I’m simplifying here:  let’s say the distribution is:

 

50% => var_A = 10

25% => var_A = 15

25% => var_A = 20

 

So in table WANT, the needed result is:

First 10 records (cd_year= 2018, cd_month = 1) => var_A = 10

Next 5 records (cd_year= 2018, cd_month = 1) => var_A = 15

Next 5 records (cd_year= 2018, cd_month = 1) => var_A = 20

 

Next 15 records (cd_year= 2018, cd_month = 2) => var_A = 10

Next 7 or 8 records (cd_year= 2018, cd_month = 2) => var_A = 15

Next 7 or 8 records (cd_year= 2018, cd_month = 2) => var_A = 20

 

Notice the last 2 lines here above:  as 25% of 30 records is not a whole number (7,5), it should take 7 or 8 records for var_A = 15.  By consequence, it should take resp. 8 or 7 records for var_A = 20.

 

I hope I explained it clearly enough so that anyone could understand what I want. 

I can’t see a method how to implement this. 

Any hints or directions how to deal with this would be very appreciated.

 

(the real distribution is a little more complicated, with 13 percentages - I used the above example just to illustrate)

 

Thanks

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @fre,

 

This creates your dataset HAVE (without var_A, for which you haven't provided values).

data have;
cd_year=2018; /* In "example wanted.xlsx" it's 2016. */
do _n_=1 to 50;
  cd_month=1+(_n_>20);
  output;
end;
run;

I'd recommend that you store your "statistical distribution" in a dataset like this:

data dist;
input prop val;
cards;
0.5  10
0.25 15
0.25 20
;

Then you can create dataset WANT including variable var_A (or value_A, as it is called in "example wanted.xlsx") as follows:

/* Create format for cumulative proportions */

data cpfmt(drop=prop);
retain fmtname 'cumprop' sexcl 'N';
set dist(rename=(val=label));
end+prop;
output;
sexcl='Y';
start+prop;
run;

proc format cntlin=cpfmt;
run;

/* Add variable var_A */

data want(drop=_:);
do until(last.cd_month);
  set have;
  by cd_year cd_month;
  _denom=sum(_denom,1);
end;
do until(last.cd_month);
  set have;
  by cd_year cd_month;
  _num=sum(_num,1);
  var_A=input(put(_num/_denom,cumprop.),8.);
  output;
end;
run;

Edit 1: minor simplification in definition of var_A.

Edit 2: Changed BY statements to

by cd_year cd_month;

in order to allow HAVE datasets containing more than one value of cd_year (sorted by cd_year cd_month).

View solution in original post

3 REPLIES 3
FreelanceReinh
Jade | Level 19

Hello @fre,

 

This creates your dataset HAVE (without var_A, for which you haven't provided values).

data have;
cd_year=2018; /* In "example wanted.xlsx" it's 2016. */
do _n_=1 to 50;
  cd_month=1+(_n_>20);
  output;
end;
run;

I'd recommend that you store your "statistical distribution" in a dataset like this:

data dist;
input prop val;
cards;
0.5  10
0.25 15
0.25 20
;

Then you can create dataset WANT including variable var_A (or value_A, as it is called in "example wanted.xlsx") as follows:

/* Create format for cumulative proportions */

data cpfmt(drop=prop);
retain fmtname 'cumprop' sexcl 'N';
set dist(rename=(val=label));
end+prop;
output;
sexcl='Y';
start+prop;
run;

proc format cntlin=cpfmt;
run;

/* Add variable var_A */

data want(drop=_:);
do until(last.cd_month);
  set have;
  by cd_year cd_month;
  _denom=sum(_denom,1);
end;
do until(last.cd_month);
  set have;
  by cd_year cd_month;
  _num=sum(_num,1);
  var_A=input(put(_num/_denom,cumprop.),8.);
  output;
end;
run;

Edit 1: minor simplification in definition of var_A.

Edit 2: Changed BY statements to

by cd_year cd_month;

in order to allow HAVE datasets containing more than one value of cd_year (sorted by cd_year cd_month).

fre
Quartz | Level 8 fre
Quartz | Level 8

Hi FreelanceReinhard,

 

thank you a million times for your code.  It's far above my skilsl, but I think I can adapt it for my use.

 

In the example, year and month are already sorted (ascending) in the first place.

If I would start from an unsorted dataset, should I first sort it (proc sort), or does this code works without sorting?

 

 

EDIT: I've managed to apply your code to my original dataset.  Such a beautiful and efficient code.  I've learned a lot today.  I'm very thankful to this Community such free support exists, and of course in particular to you, Reinhard!

FreelanceReinh
Jade | Level 19

Hi @fre, you're welcome. Yes, the HAVE dataset must be sorted (or indexed) by cd_year cd_month, because the final DATA step uses SET in conjunction with BY statements. I have extended the BY statements (see edited code) in order to allow HAVE datasets containing more than one value of cd_year , for example:

data have;
do cd_year=2016 to 2018;
  do _n_=1 to 50;
    cd_month=1+(_n_>20);
    output;
  end;
end;
run;

So, if HAVE is not yet sorted, use PROC SORT first:

proc sort data=have;
by cd_year cd_month;
run;

@fre wrote:

It's far above my skill, but I think I can adapt it for my use.

 


This sounds a bit risky. Please don't hesitate to ask if you have further questions. I'll be happy to check your adaptations if you like.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1007 views
  • 1 like
  • 2 in conversation