Solved: how to assign values based on condition

fre · Posted 11-13-2018 05:45 AM

Dear Community

I’m looking for a program for the following.

What I have is a table HAVE. It has 3 fields: CD_YEAR, CD_MONTH, var_A.

This table contains 50 records:

The first 20 records are all CD_YEAR = 2018 and CD_MONTH = 1

The next 30 records are all CD_YEAR = 2018 and CD_MONTH = 2

What I need is a table WANT, where for var_A, a value is filled in following a statistical distribution.

I’m simplifying here: let’s say the distribution is:

50% => var_A = 10

25% => var_A = 15

25% => var_A = 20

So in table WANT, the needed result is:

First 10 records (cd_year= 2018, cd_month = 1) => var_A = 10

Next 5 records (cd_year= 2018, cd_month = 1) => var_A = 15

Next 5 records (cd_year= 2018, cd_month = 1) => var_A = 20

Next 15 records (cd_year= 2018, cd_month = 2) => var_A = 10

Next 7 or 8 records (cd_year= 2018, cd_month = 2) => var_A = 15

Next 7 or 8 records (cd_year= 2018, cd_month = 2) => var_A = 20

Notice the last 2 lines here above: as 25% of 30 records is not a whole number (7,5), it should take 7 or 8 records for var_A = 15. By consequence, it should take resp. 8 or 7 records for var_A = 20.

I hope I explained it clearly enough so that anyone could understand what I want.

I can’t see a method how to implement this.

Any hints or directions how to deal with this would be very appreciated.

(the real distribution is a little more complicated, with 13 percentages - I used the above example just to illustrate)

Thanks

FreelanceReinh · Posted 11-13-2018 06:48 AM

Hello @fre,

This creates your dataset HAVE (without var_A, for which you haven't provided values).

data have;
cd_year=2018; /* In "example wanted.xlsx" it's 2016. */
do _n_=1 to 50;
  cd_month=1+(_n_>20);
  output;
end;
run;

I'd recommend that you store your "statistical distribution" in a dataset like this:

data dist;
input prop val;
cards;
0.5  10
0.25 15
0.25 20
;

Then you can create dataset WANT including variable var_A (or value_A, as it is called in "example wanted.xlsx") as follows:

/* Create format for cumulative proportions */

data cpfmt(drop=prop);
retain fmtname 'cumprop' sexcl 'N';
set dist(rename=(val=label));
end+prop;
output;
sexcl='Y';
start+prop;
run;

proc format cntlin=cpfmt;
run;

/* Add variable var_A */

data want(drop=_:);
do until(last.cd_month);
  set have;
  by cd_year cd_month;
  _denom=sum(_denom,1);
end;
do until(last.cd_month);
  set have;
  by cd_year cd_month;
  _num=sum(_num,1);
  var_A=input(put(_num/_denom,cumprop.),8.);
  output;
end;
run;

Edit 1: minor simplification in definition of var_A.

Edit 2: Changed BY statements to

by cd_year cd_month;

in order to allow HAVE datasets containing more than one value of cd_year (sorted by cd_year cd_month).

View solution in original post

FreelanceReinh · Posted 11-13-2018 06:48 AM

Hello @fre,

This creates your dataset HAVE (without var_A, for which you haven't provided values).

data have;
cd_year=2018; /* In "example wanted.xlsx" it's 2016. */
do _n_=1 to 50;
  cd_month=1+(_n_>20);
  output;
end;
run;

I'd recommend that you store your "statistical distribution" in a dataset like this:

data dist;
input prop val;
cards;
0.5  10
0.25 15
0.25 20
;

Then you can create dataset WANT including variable var_A (or value_A, as it is called in "example wanted.xlsx") as follows:

/* Create format for cumulative proportions */

data cpfmt(drop=prop);
retain fmtname 'cumprop' sexcl 'N';
set dist(rename=(val=label));
end+prop;
output;
sexcl='Y';
start+prop;
run;

proc format cntlin=cpfmt;
run;

/* Add variable var_A */

data want(drop=_:);
do until(last.cd_month);
  set have;
  by cd_year cd_month;
  _denom=sum(_denom,1);
end;
do until(last.cd_month);
  set have;
  by cd_year cd_month;
  _num=sum(_num,1);
  var_A=input(put(_num/_denom,cumprop.),8.);
  output;
end;
run;

Edit 1: minor simplification in definition of var_A.

Edit 2: Changed BY statements to

by cd_year cd_month;

in order to allow HAVE datasets containing more than one value of cd_year (sorted by cd_year cd_month).

fre · Posted 11-13-2018 07:46 AM

Hi FreelanceReinhard,

thank you a million times for your code. It's far above my skilsl, but I think I can adapt it for my use.

In the example, year and month are already sorted (ascending) in the first place.

If I would start from an unsorted dataset, should I first sort it (proc sort), or does this code works without sorting?

EDIT: I've managed to apply your code to my original dataset. Such a beautiful and efficient code. I've learned a lot today. I'm very thankful to this Community such free support exists, and of course in particular to you, Reinhard!

FreelanceReinh · Posted 11-13-2018 09:06 AM

Hi @fre, you're welcome. Yes, the HAVE dataset must be sorted (or indexed) by cd_year cd_month, because the final DATA step uses SET in conjunction with BY statements. I have extended the BY statements (see edited code) in order to allow HAVE datasets containing more than one value of cd_year , for example:

data have;
do cd_year=2016 to 2018;
  do _n_=1 to 50;
    cd_month=1+(_n_>20);
    output;
  end;
end;
run;

So, if HAVE is not yet sorted, use PROC SORT first:

proc sort data=have;
by cd_year cd_month;
run;

@fre wrote:

It's far above my skill, but I think I can adapt it for my use.

This sounds a bit risky. Please don't hesitate to ask if you have further questions. I'll be happy to check your adaptations if you like.

how to assign values based on condition

Re: how to assign values based on condition

Re: how to assign values based on condition

Re: how to assign values based on condition

Re: how to assign values based on condition

how to assign values based on condition

Re: how to assign values based on condition

Re: how to assign values based on condition

Re: how to assign values based on condition

Re: how to assign values based on condition

Ready to join fellow brilliant minds for the SAS Hackathon?