Re: calculate multiple activities for some variables

Ronein · Posted 06-13-2024 03:19 PM

Hello

The raw data set have till 6 rows for each customer (Each row represent information of a month).

It might happen that customer have less than 6 rows.

From list of variables (let's say X,Y,Z) I need to calculate for each variable a new calculated variable (called calc_X,Calc_Y,Calc_Z) .

The logic how to calculate these 3 variables is as following:

1-If customer have less then 4 rows then each of the variables be null (calc_X,Calc_Y,Calc_Z)

2-If customer have at least 4 rows then need to calculate do this calculation for each of the variables (X,Y,Z). calculate CV(STD/AVG)

IF CV<=0.8 then the calculated value equal to average

else IF 0.8<CV<=1.5 then calculated value equal to the value of second observation after sort it from low to high

else IF CV>1.5 then calculated value equal to the value of First observation after sort it from low to high (note- sort by the variable that we want to calculate. For example: if we calculate calc_X then sort by X)

Here is example to raw data set.

Please not the request to calculated : Calc_X,Calc_Y,Calc_Z (and not calculate for R and T)

In real life I have 100 variables to calculate

Data have;
Input custID YYYYMM X Y Z R T;
cards;
1 202206 10 20 30 40 50
1 202207 15 20 25 30 35
1 202208 30 20 15 10 12
1 202209 15 18 21 27 31 
1 202210 19 22 40 15 12
1 202211 40 20 30 10 25
2 202210 25 45 35 15 10
2 202211 20 16 18 24 31
;
Run;

What id the recommended way to do it please?

ballardw · Posted 06-13-2024 04:40 PM

else IF 0.8<CV<=1.5 then calculated value equal to the value of second observation after sort it from low to high
else IF CV>1.5 then calculated value equal to the value of First observation after sort it from low to high

Sort by which variable? The value of what?

Note that your data step does not have a Datalines/Cards statement and hence does not run.

Ronein · Posted 06-13-2024 04:50 PM

sort by the variable that we want to calculate. For example: if we calculate calc_X then sort by X

ballardw · Posted 06-13-2024 05:42 PM

Some of the below code is so we don't have to fool around with "100's of variable" names.

I changed some values in the data step so there were some examples of the different ranges of CV to test.

Note that reshaping the data allows sorting into the order and selecting the value you want after adding an ordering variable.

Data have;
Input custID YYYYMM X Y Z R T;
datalines;
1 202206 100 20 900 40 50
1 202207 15 20 25 30 35
1 202208 30 20 15 10 12
1 202209 15 18 21 27 31 
1 202210 19 22 40 15 12
1 202211 40 20 30 10 25
2 202210 25 45 35 15 10
2 202211 20 16 18 24 31
;
Run;

proc transpose data=have out=trans (rename=(_name_=Var))
   ;
   by custid;
   var x y z;
run;

proc sort data=trans;
   by custid Var;
run;

proc transpose data=trans out=trans2;
  by custid Var;
  var col: ;
run;

proc sort data=trans2;
  by custid Var Col1;
run;
  
proc summary data=trans2 nway;
   class custid var;
   var col1;
   output out=summary (drop=_type_ _freq_) cv=CV mean= mean n=n std=std/autoname;
run;
/* the above cv is the percent, ie std/mean multiplied by 100  */
data need;
   set summary;
   cv=cv/100;
   If n ge 4 then do;
      if cv le 0.8 then calc=mean;
      else IF 0.8< CV le 1.5 then use=2;
      else IF CV>1.5 then use=1;
   end;
   else calc =.;
run;

data order;
   set trans2 (drop=_name_);
   by custid var;
   retain use;
   if first.var then use=1;
   else use+1;
run;

data need2;
   merge order 
         need (in=inneed)
   ;
   by custid var use;
   if inneed;
   if use then calc=col1;
run;

proc transpose data=need2 out=want (drop=_name_)
     prefix=calc_
;
   by custid;
   id var;
   var calc;
run;

If you have hundreds of thousands of observations the transpose steps may take a long time and hopefully won't run out of memory.

Ronein · Posted 06-13-2024 05:53 PM

In real there are 12 million rows with 50 columns to calculate

Ronein · Posted 06-14-2024 05:46 AM

Where did you apply the logic of second smallest value or smallest value ?

mkeintz · Posted 06-13-2024 08:34 PM

You can effectively reshape the data by using a hash object, reducing code to a single DATA step. The hash object H below has only 3 dataitems (with variables _VALUES1 through _VALUES63). One dataitem (think one row) for X. One for Y and one for Z.

Using dataset HAVE as provided by @ballardw, you can do the below.\:


data have;
Input custID YYYYMM X Y Z R T;
datalines;
1 202206 100 20 900 40 50
1 202207 15 20 25 30 35
1 202208 30 20 15 10 12
1 202209 15 18 21 27 31 
1 202210 19 22 40 15 12
1 202211 40 20 30 10 25
2 202210 25 45 35 15 10
2 202211 20 16 18 24 31
run;
data want (drop=_: x y z);
  set have (keep=custid x y z);
  by custid;

  array _values{6} ;
  length _key 8;
  if _n_=1 then do;
    declare hash h ();
      h.definekey('_key');
      h.definedata('_values1','_values2','_values3','_values4','_values5','_values6');
      h.definedone();
  end;

  if first.custid then call missing(_n,of _:);
  _n+1;

  array xyz  {3} x y z;
  do _key=1 to 3;
    _rc=h.find();
    _values{_n}=xyz{_key};
    h.replace();
  end;

  if last.custid;

  array  calc{*} x_calc y_calc z_calc ;

  if _n<4 then call missing(of calc{*});
  else do _key=1 to 3;
    h.find();
    _cv=std(of _values{*})/mean(of _values{*});
    if _cv<=0.8 then calc{_key}=mean(of _values{*}); else
    if _cv<=1.5 then calc{_key}=smallest(2,of _values{*});  else
    calc{_key}=min(of _values{*});
  end;
  h.clear();
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ronein · Posted 06-13-2024 08:44 PM

Please note that if the 6 values are for example : 20,10,20,20,20,10
Then after order it from low to high
10,10,20,20,20,20
Value of 1st is 10
Value of 2nd is 10
Mean is 100/6
Will smallest 2 function that used give 10 or 20?

ballardw · Posted 06-13-2024 11:57 PM

@Ronein wrote:
Please note that if the 6 values are for example : 20,10,20,20,20,10
Then after order it from low to high
10,10,20,20,20,20
Value of 1st is 10
Value of 2nd is 10
Mean is 100/6
Will smallest 2 function that used give 10 or 20?

Why not write some code and test it yourself if you have questions:

data example;
   result = smallest(2,10,10,20,20,20,20);
   /* and without ordering the values*/
   result2 = smallest(2,20,10,20,20,10,20);
run;

whymath · Posted 06-14-2024 02:14 AM

Based on anwser of @mkeintz , here is an DOW-Loop + Array version:

data want;
  do i=1 by 1 until(last.custid);
    set have;
    by custid;
    array _raw_[*] x y z;
    array _val_[3,99]_temporary_;
    array _cal_[*] calc_x calc_y calc_z;
    do j=1 to dim(_raw_);
      _val_[j,i]=_raw_[j];
    end;
  end;

  array _tmp_[99]_temporary_;
  if i>4 then do j=1 to dim(_raw_);
    do k=1 to i;
      _tmp_[k]=_val_[j,k];
    end;
    cv=cv(of _tmp_[*])/100;     *The function CV() multipy the result 100 times by default;

    if cv>1.5 then _cal_[j]=min(of _tmp_[*]);
    else if cv>0.8 then _cal_[j]=smallest(2,of _tmp_[*]);
    else if cv>. then _cal_[j]=mean(of _tmp_[*]);
  end;
  drop i j k cv;
run;

Array _val_ has two dimension: the first for number of variables, the second for oberservations of different custid , you can use bigger numbers to process more variables and observations, with not signicicant increasment on time consuming.

Ronein · Posted 06-14-2024 04:09 AM

If I have 20 variables and 2 million customers ( let's say that each customer has max 6 rows but can have less ). What will be written in _val_[3,99]?
I understand that instead of 3 need to write 20.but what about 99? What should write there ?

whymath · Posted 06-16-2024 10:15 PM

It can be _val_[20,6]. This array is designed to collect all values of specified variables, so the second dimension could have a big number like 99. But if you can sure that each customer has max 6 rows, 6 is enough to be the second dimension.

Ronein · Posted 06-14-2024 05:39 AM

Thanks!

What happens if mean=0 ? then CV calculation will get error

PaigeMiller · Posted 06-14-2024 06:07 AM

@Ronein wrote:

What happens if mean=0 ? then CV calculation will get error

It does not generate an error, it generates a NOTE and the rest of the data set continues to run. You should try it and find out what happens when mean=0.

If what does happen isn't what you want, what do you want to happen in that case? Tell us what you want.

--
Paige Miller

Registration is open

SAS Training: Just a Click Away