Hello
The raw data set have till 6 rows for each customer (Each row represent information of a month).
It might happen that customer have less than 6 rows.
From list of variables (let's say X,Y,Z) I need to calculate for each variable a new calculated variable (called calc_X,Calc_Y,Calc_Z) .
The logic how to calculate these 3 variables is as following:
1-If customer have less then 4 rows then each of the variables be null (calc_X,Calc_Y,Calc_Z)
2-If customer have at least 4 rows then need to calculate do this calculation for each of the variables (X,Y,Z). calculate CV(STD/AVG)
IF CV<=0.8 then the calculated value equal to average
else IF 0.8<CV<=1.5 then calculated value equal to the value of second observation after sort it from low to high
else IF CV>1.5 then calculated value equal to the value of First observation after sort it from low to high (note- sort by the variable that we want to calculate. For example: if we calculate calc_X then sort by X)
Here is example to raw data set.
Please not the request to calculated : Calc_X,Calc_Y,Calc_Z (and not calculate for R and T)
In real life I have 100 variables to calculate
Data have;
Input custID YYYYMM X Y Z R T;
cards;
1 202206 10 20 30 40 50
1 202207 15 20 25 30 35
1 202208 30 20 15 10 12
1 202209 15 18 21 27 31
1 202210 19 22 40 15 12
1 202211 40 20 30 10 25
2 202210 25 45 35 15 10
2 202211 20 16 18 24 31
;
Run;
What id the recommended way to do it please?
else IF 0.8<CV<=1.5 then calculated value equal to the value of second observation after sort it from low to high
else IF CV>1.5 then calculated value equal to the value of First observation after sort it from low to high
Sort by which variable? The value of what?
Note that your data step does not have a Datalines/Cards statement and hence does not run.
sort by the variable that we want to calculate. For example: if we calculate calc_X then sort by X
Some of the below code is so we don't have to fool around with "100's of variable" names.
I changed some values in the data step so there were some examples of the different ranges of CV to test.
Note that reshaping the data allows sorting into the order and selecting the value you want after adding an ordering variable.
Data have; Input custID YYYYMM X Y Z R T; datalines; 1 202206 100 20 900 40 50 1 202207 15 20 25 30 35 1 202208 30 20 15 10 12 1 202209 15 18 21 27 31 1 202210 19 22 40 15 12 1 202211 40 20 30 10 25 2 202210 25 45 35 15 10 2 202211 20 16 18 24 31 ; Run; proc transpose data=have out=trans (rename=(_name_=Var)) ; by custid; var x y z; run; proc sort data=trans; by custid Var; run; proc transpose data=trans out=trans2; by custid Var; var col: ; run; proc sort data=trans2; by custid Var Col1; run; proc summary data=trans2 nway; class custid var; var col1; output out=summary (drop=_type_ _freq_) cv=CV mean= mean n=n std=std/autoname; run; /* the above cv is the percent, ie std/mean multiplied by 100 */ data need; set summary; cv=cv/100; If n ge 4 then do; if cv le 0.8 then calc=mean; else IF 0.8< CV le 1.5 then use=2; else IF CV>1.5 then use=1; end; else calc =.; run; data order; set trans2 (drop=_name_); by custid var; retain use; if first.var then use=1; else use+1; run; data need2; merge order need (in=inneed) ; by custid var use; if inneed; if use then calc=col1; run; proc transpose data=need2 out=want (drop=_name_) prefix=calc_ ; by custid; id var; var calc; run;
If you have hundreds of thousands of observations the transpose steps may take a long time and hopefully won't run out of memory.
Where did you apply the logic of second smallest value or smallest value ?
You can effectively reshape the data by using a hash object, reducing code to a single DATA step. The hash object H below has only 3 dataitems (with variables _VALUES1 through _VALUES63). One dataitem (think one row) for X. One for Y and one for Z.
Using dataset HAVE as provided by @ballardw, you can do the below.\:
data have;
Input custID YYYYMM X Y Z R T;
datalines;
1 202206 100 20 900 40 50
1 202207 15 20 25 30 35
1 202208 30 20 15 10 12
1 202209 15 18 21 27 31
1 202210 19 22 40 15 12
1 202211 40 20 30 10 25
2 202210 25 45 35 15 10
2 202211 20 16 18 24 31
run;
data want (drop=_: x y z);
set have (keep=custid x y z);
by custid;
array _values{6} ;
length _key 8;
if _n_=1 then do;
declare hash h ();
h.definekey('_key');
h.definedata('_values1','_values2','_values3','_values4','_values5','_values6');
h.definedone();
end;
if first.custid then call missing(_n,of _:);
_n+1;
array xyz {3} x y z;
do _key=1 to 3;
_rc=h.find();
_values{_n}=xyz{_key};
h.replace();
end;
if last.custid;
array calc{*} x_calc y_calc z_calc ;
if _n<4 then call missing(of calc{*});
else do _key=1 to 3;
h.find();
_cv=std(of _values{*})/mean(of _values{*});
if _cv<=0.8 then calc{_key}=mean(of _values{*}); else
if _cv<=1.5 then calc{_key}=smallest(2,of _values{*}); else
calc{_key}=min(of _values{*});
end;
h.clear();
run;
@Ronein wrote:
Please note that if the 6 values are for example : 20,10,20,20,20,10
Then after order it from low to high
10,10,20,20,20,20
Value of 1st is 10
Value of 2nd is 10
Mean is 100/6
Will smallest 2 function that used give 10 or 20?
Why not write some code and test it yourself if you have questions:
data example; result = smallest(2,10,10,20,20,20,20); /* and without ordering the values*/ result2 = smallest(2,20,10,20,20,10,20); run;
Based on anwser of @mkeintz , here is an DOW-Loop + Array version:
data want;
do i=1 by 1 until(last.custid);
set have;
by custid;
array _raw_[*] x y z;
array _val_[3,99]_temporary_;
array _cal_[*] calc_x calc_y calc_z;
do j=1 to dim(_raw_);
_val_[j,i]=_raw_[j];
end;
end;
array _tmp_[99]_temporary_;
if i>4 then do j=1 to dim(_raw_);
do k=1 to i;
_tmp_[k]=_val_[j,k];
end;
cv=cv(of _tmp_[*])/100; *The function CV() multipy the result 100 times by default;
if cv>1.5 then _cal_[j]=min(of _tmp_[*]);
else if cv>0.8 then _cal_[j]=smallest(2,of _tmp_[*]);
else if cv>. then _cal_[j]=mean(of _tmp_[*]);
end;
drop i j k cv;
run;
Array _val_
has two dimension: the first for number of variables, the second for oberservations of different custid
, you can use bigger numbers to process more variables and observations, with not signicicant increasment on time consuming.
Thanks!
What happens if mean=0 ? then CV calculation will get error
@Ronein wrote:
What happens if mean=0 ? then CV calculation will get error
It does not generate an error, it generates a NOTE and the rest of the data set continues to run. You should try it and find out what happens when mean=0.
If what does happen isn't what you want, what do you want to happen in that case? Tell us what you want.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.