Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- Programming
- /
- calculate multiple activities for some variables

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 06-13-2024 03:19 PM
(767 views)

Hello

The raw data set have till 6 rows for each customer (Each row represent information of a month).

It might happen that customer have less than 6 rows.

From list of variables (let's say X,Y,Z) I need to calculate for each variable a new calculated variable (called calc_X,Calc_Y,Calc_Z) .

The logic how to calculate these 3 variables is as following:

1-If customer have less then 4 rows then each of the variables be null (calc_X,Calc_Y,Calc_Z)

2-If customer have at least 4 rows then need to calculate do this calculation for each of the variables (X,Y,Z). calculate CV(STD/AVG)

IF CV<=0.8 then the calculated value equal to average

else IF 0.8<CV<=1.5 then calculated value equal to the value of second observation after sort it from low to high

else IF CV>1.5 then calculated value equal to the value of First observation after sort it from low to high (note- sort by the variable that we want to calculate. For example: if we calculate calc_X then sort by X)

Here is example to raw data set.

Please not the request to calculated : Calc_X,Calc_Y,Calc_Z (and not calculate for R and T)

In real life I have 100 variables to calculate

```
Data have;
Input custID YYYYMM X Y Z R T;
```

cards;
1 202206 10 20 30 40 50
1 202207 15 20 25 30 35
1 202208 30 20 15 10 12
1 202209 15 18 21 27 31
1 202210 19 22 40 15 12
1 202211 40 20 30 10 25
2 202210 25 45 35 15 10
2 202211 20 16 18 24 31
;
Run;

What id the recommended way to do it please?

13 REPLIES 13

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

else IF 0.8<CV<=1.5 then calculated value equal to the value of second observation after sort it from low to high

else IF CV>1.5 then calculated value equal to the value of First observation after sort it from low to high

Sort by which variable? The value of what?

Note that your data step does not have a Datalines/Cards statement and hence does not run.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

sort by the variable that we want to calculate. For example: if we calculate calc_X then sort by X

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Some of the below code is so we don't have to fool around with "100's of variable" names.

I changed some values in the data step so there were some examples of the different ranges of CV to test.

Note that reshaping the data allows sorting into the order and selecting the value you want after adding an ordering variable.

Data have; Input custID YYYYMM X Y Z R T; datalines; 1 202206 100 20 900 40 50 1 202207 15 20 25 30 35 1 202208 30 20 15 10 12 1 202209 15 18 21 27 31 1 202210 19 22 40 15 12 1 202211 40 20 30 10 25 2 202210 25 45 35 15 10 2 202211 20 16 18 24 31 ; Run; proc transpose data=have out=trans (rename=(_name_=Var)) ; by custid; var x y z; run; proc sort data=trans; by custid Var; run; proc transpose data=trans out=trans2; by custid Var; var col: ; run; proc sort data=trans2; by custid Var Col1; run; proc summary data=trans2 nway; class custid var; var col1; output out=summary (drop=_type_ _freq_) cv=CV mean= mean n=n std=std/autoname; run; /* the above cv is the percent, ie std/mean multiplied by 100 */ data need; set summary; cv=cv/100; If n ge 4 then do; if cv le 0.8 then calc=mean; else IF 0.8< CV le 1.5 then use=2; else IF CV>1.5 then use=1; end; else calc =.; run; data order; set trans2 (drop=_name_); by custid var; retain use; if first.var then use=1; else use+1; run; data need2; merge order need (in=inneed) ; by custid var use; if inneed; if use then calc=col1; run; proc transpose data=need2 out=want (drop=_name_) prefix=calc_ ; by custid; id var; var calc; run;

If you have hundreds of thousands of observations the transpose steps may take a long time and hopefully won't run out of memory.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

In real there are 12 million rows with 50 columns to calculate

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Where did you apply the logic of second smallest value or smallest value ?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You can effectively reshape the data by using a hash object, reducing code to a single DATA step. The hash object H below has only 3 dataitems (with variables _VALUES1 through _VALUES6~~3~~). One dataitem (think one row) for X. One for Y and one for Z.

Using dataset HAVE as provided by @ballardw, you can do the below.\:

```
data have;
Input custID YYYYMM X Y Z R T;
datalines;
1 202206 100 20 900 40 50
1 202207 15 20 25 30 35
1 202208 30 20 15 10 12
1 202209 15 18 21 27 31
1 202210 19 22 40 15 12
1 202211 40 20 30 10 25
2 202210 25 45 35 15 10
2 202211 20 16 18 24 31
run;
data want (drop=_: x y z);
set have (keep=custid x y z);
by custid;
array _values{6} ;
length _key 8;
if _n_=1 then do;
declare hash h ();
h.definekey('_key');
h.definedata('_values1','_values2','_values3','_values4','_values5','_values6');
h.definedone();
end;
if first.custid then call missing(_n,of _:);
_n+1;
array xyz {3} x y z;
do _key=1 to 3;
_rc=h.find();
_values{_n}=xyz{_key};
h.replace();
end;
if last.custid;
array calc{*} x_calc y_calc z_calc ;
if _n<4 then call missing(of calc{*});
else do _key=1 to 3;
h.find();
_cv=std(of _values{*})/mean(of _values{*});
if _cv<=0.8 then calc{_key}=mean(of _values{*}); else
if _cv<=1.5 then calc{_key}=smallest(2,of _values{*}); else
calc{_key}=min(of _values{*});
end;
h.clear();
run;
```

--------------------------

The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for

Allow PROC SORT to output multiple datasets

--------------------------

The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for

Allow PROC SORT to output multiple datasets

--------------------------

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Please note that if the 6 values are for example : 20,10,20,20,20,10

Then after order it from low to high

10,10,20,20,20,20

Value of 1st is 10

Value of 2nd is 10

Mean is 100/6

Will smallest 2 function that used give 10 or 20?

Then after order it from low to high

10,10,20,20,20,20

Value of 1st is 10

Value of 2nd is 10

Mean is 100/6

Will smallest 2 function that used give 10 or 20?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Ronein wrote:

Please note that if the 6 values are for example : 20,10,20,20,20,10

Then after order it from low to high

10,10,20,20,20,20

Value of 1st is 10

Value of 2nd is 10

Mean is 100/6

Will smallest 2 function that used give 10 or 20?

Why not write some code and test it yourself if you have questions:

data example; result = smallest(2,10,10,20,20,20,20); /* and without ordering the values*/ result2 = smallest(2,20,10,20,20,10,20); run;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Based on anwser of @mkeintz , here is an DOW-Loop + Array version:

```
data want;
do i=1 by 1 until(last.custid);
set have;
by custid;
array _raw_[*] x y z;
array _val_[3,99]_temporary_;
array _cal_[*] calc_x calc_y calc_z;
do j=1 to dim(_raw_);
_val_[j,i]=_raw_[j];
end;
end;
array _tmp_[99]_temporary_;
if i>4 then do j=1 to dim(_raw_);
do k=1 to i;
_tmp_[k]=_val_[j,k];
end;
cv=cv(of _tmp_[*])/100; *The function CV() multipy the result 100 times by default;
if cv>1.5 then _cal_[j]=min(of _tmp_[*]);
else if cv>0.8 then _cal_[j]=smallest(2,of _tmp_[*]);
else if cv>. then _cal_[j]=mean(of _tmp_[*]);
end;
drop i j k cv;
run;
```

Array `_val_`

has two dimension: the first for number of variables, the second for oberservations of different `custid`

, you can use bigger numbers to process more variables and observations, with not signicicant increasment on time consuming.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

If I have 20 variables and 2 million customers ( let's say that each customer has max 6 rows but can have less ). What will be written in _val_[3,99]?

I understand that instead of 3 need to write 20.but what about 99? What should write there ?

I understand that instead of 3 need to write 20.but what about 99? What should write there ?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It can be _val_[20,6]. This array is designed to collect all values of specified variables, so the second dimension could have a big number like 99. But if you can sure that each customer has max 6 rows, 6 is enough to be the second dimension.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks!

What happens if mean=0 ? then CV calculation will get error

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Ronein wrote:

What happens if mean=0 ? then CV calculation will get error

It does not generate an error, it generates a NOTE and the rest of the data set continues to run. You should try it and find out what happens when mean=0.

If what does happen isn't what you want, what do you want to happen in that case? Tell us what you want.

--

Paige Miller

Paige Miller

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.