Solved: Find Standard deviation from 2, or more, columns/variables with aim of...

jonatan_velarde · Posted 04-27-2024 08:11 PM

Good night guys:

Last post i tried to explain an example, but for some logistic reasons i couldn't finish the explanation there so here i go again:

i have this data set:

data have;
input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;

proc sql;
create table want as
select *,
sum(south + north)/(count(south + north)) as mean_south_north,
std(south and north) as sd_south_north
from have;
quit;

proc print data = want;
run;, but unfortunately i got this:

Obs	id	south	north	west	east	sum_south_north	mean_south_north	sd_south_north	sum_west_east	mean_west_east
1	1	10	.	15	9	116	25.2	8.48528	74	12.3333
2	2	40	12	10	14	116	25.2	8.48528	74	12.3333
3	3	50	14	13	13	116	25.2	8.48528	74	12.3333

The RED BOLD text in the table above represents the standard deviation wrongly found, where the correct value must be 18.4715998224301, as the real value of the standard deviation of 10, 40, 50, 12 and 14. Must be noted that it is a missing value in the data set, so it is very important to consider also.

Other important issue, it would be grate make the program to find the standard deviation from other columns and put them as other variable, for example the standard deviation of west and east observations , as std_west_east, considering GREEN DATA in the table above

i appreciate any help to complete this task correcting the code i provided to you above.

thanks in advance

mkeintz · Posted 04-27-2024 09:15 PM

There is no way your expression

std(south and north) as sd_south_north

can produce 8.4825 (or 18.4715998224301).

That's because the expression "south and north" is a logical expression yielding either a zero or one. So std(south and north) would be getting the STD of a collection of 1's and 0's.

Please provide the code (and data) you used to produce the unexpected results you show.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Patrick · Posted 04-27-2024 08:40 PM

The data in your script that creates table Have doesn't match the desired result you show us in the data grid with the wrong numbers. Can you please fix/align this so we start with the correct data. And also please ensure that the desired data you show us is correct unless otherwise stated - like shouldn't sum_south_north show a value of 126?

You also mentioned in your earlier post that there will be up-to 500 variables for which you want do create such means and std. Can you please already now be specific how the "pattern" will look like. Will there always be pairs of variables or could it be any number of variables that you want to "combine" for such calculations. To be specific about this already now is important as else you might get code proposed that you can't easily "generalize" for all your other cases.

jonatan_velarde · Posted 04-27-2024 09:42 PM

Hi Patrick and thanks for your answers. So now ill explain this example:

The last example explains that i have sets of variables, sometimes with distinct number of columns, and that why i need a basic sql code to get the standar deviation, eht mean was solved already in the last post, but now i need the standard deviation

data have;
input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;

proc sql;
create table want as
select *,
sum(south + north)/(count(south + north)) as mean_south_north,
std(south, north) as sd_south_north
from have;
quit;

proc print data = want;
run;

I know that my code isn't efficient enough to get the value of standard deviation from two, or more, columns, that's why i come here to be helped.

Now, my need is very simple, obtain the standar deviation from two or more numeric columns, includiing columns with lost observations, to put this exact value in a whole column to estimate new values and proceed with other tasks.

thanks for the comprehension

jonatan_velarde · Posted 04-27-2024 09:45 PM

this is the base code:

sorry for the mistake

data have;
input id south north west east;
cards;
1 10 . 15 9
2 40 12 10 14
3 50 14 13 13
;

proc sql;
create table want as
select *,
sum(south + north) as sum_south_north,
(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north,
std(south + north) as sd_south_north,
sum(west + east) as sum_west_east,
(sum(west) + sum(east))/(count(west)+count(east)) as mean_west_east,
std(west + east) as sd_south_north
from have;
quit;

proc print data = want;
run;

Patrick · Posted 04-27-2024 09:58 PM

@jonatan_velarde I hope you realize that your current code returns a wrong result for the SUM of south north. It should be 126. The mean uses the correct sum: 126 - 126/5=25.2

The last example explains that i have sets of variables, sometimes with distinct number of columns

Variables and Columns are two terms for the same!

	Variable/Column
Row/Observation/Record

In your sample data you want the mean and std over two columns/variables treated as if if they were in a single column and multiple rows.

To provide code that's easily extensible for other such cases we really need to know if your "sets of variables" always come in pairs like in the sample data or if there could also be cases with only a single or more than two variables. Can you please clarify.

jonatan_velarde · Posted 04-27-2024 10:03 PM

the data set i have to apply this new code does not have sets of pair of variables, sometimes are 47, other are 14, sometimes are 6. That's why i was asking a flexible code to find these values

Patrick · Posted 04-27-2024 11:07 PM

@jonatan_velarde wrote:
the data set i have to apply this new code does not have sets of pair of variables, sometimes are 47, other are 14, sometimes are 6. That's why i was asking a flexible code to find these values

If you are after flexible code then you need to tell us upfront what sort of flexibility you require and best provide representative sample data that already allows to test for such flexibility.

Below code returns the std number you desire for your sample data and the code should it also make rather easy to add additional sets of variables for the same sort of analysis.

To add additional groups of variables it should be sufficient to define additional groups in the Proc Format bit and also add these additional variables to the array statement under the next data step.

data have;
  input id south north west east other_var;
cards;
1 10 . 15 9 1
2 40 12 10 14 2
3 50 14 13 13 3
;

/* transform source data into a structure better suitable for analysis */
proc format;
  invalue $groups(upcase default=32)
    'SOUTH','NORTH' = 'south_north'
    'WEST','EAST'    = 'west_east'
    other = _same_
    ;
run;

data long(keep=analysis_group val);
  set have;
  array vars{*} south north west east other_var;
  do i=1 to dim(vars);
    analysis_group=input(vname(vars[i]),$groups.);
    val=vars[i];
    output;
  end;
run;

/* analysis */
proc sql;
  create table result_long as
  select 
    analysis_group
    ,sum(val) as sum_
    ,mean(val) as mean_
    ,std(val) as std_
  from long
  group by analysis_group
  ;
quit;

/* transform analysis result data structure for merge to source data structure */
proc datasets lib=work nolist nowarn;
  delete result_wide_:;
quit;

%let aggr=sum_;
proc transpose data=result_long out=result_wide_&aggr(drop=_name_) prefix=&aggr.;
  id analysis_group;
  var &aggr. ;
run;
%let aggr=mean_;
proc transpose data=result_long out=result_wide_&aggr(drop=_name_) prefix=&aggr.;
  id analysis_group;
  var &aggr. ;
run;
%let aggr=std_;
proc transpose data=result_long out=result_wide_&aggr(drop=_name_) prefix=&aggr.;
  id analysis_group;
  var &aggr. ;
run;

data result_wide;
  merge result_wide_:;
run;

/* add analysis results to source data structure */
data want;
  set have;
  if _n_=1 then set result_wide;
run;

proc print data=want;
run;

Patrick · Posted 04-27-2024 11:41 PM

@jonatan_velarde I'm now a bit confused why you accepted the answer that your code won't work as solution but haven't even had the courtesy to react on the solution I proposed which I believe addresses your requirements and returns the result that you defined as correct.

mkeintz · Posted 04-27-2024 09:15 PM

There is no way your expression

std(south and north) as sd_south_north

can produce 8.4825 (or 18.4715998224301).

That's because the expression "south and north" is a logical expression yielding either a zero or one. So std(south and north) would be getting the STD of a collection of 1's and 0's.

Please provide the code (and data) you used to produce the unexpected results you show.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

jonatan_velarde · Posted 04-27-2024 09:55 PM

thank you for your answer:

here is the correct code:

data have;
input id south north west east;
cards;
1 10 . 15 9
2 40 12 10 14
3 50 14 13 13
;

proc sql;
create table want as
select *,
sum(south + north) as sum_south_north,
(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north,
std(south + north) as sd_south_north,
sum(west + east) as sum_west_east,
(sum(west) + sum(east))/(count(west)+count(east)) as mean_west_east,
std(west + east) as sd_south_north
from have;
quit;

proc print data = want;
run;

mkeintz · Posted 04-27-2024 11:50 PM

The reason your code produces sd_south_north=8.485 (instead of 18.475) is the same as the reason it produces sum_south_north=116 (instead of 126). That's because it is treating north+south as a single calculated value for each row. This means

It excludes row 1, because north is missing, making north+sourth missing.
It provides statistics on the sum calculated for each row, not the individual components of the sum. I.e. the standard deviation is std(40+12,50+14)= std(52,64) = 8.485.

If you really must use sql to statistically treat two columns as if they were one long column, then you will have messy expressions, like

dm 'clear out';
data have;
input id south north west east;
cards;
1 10 . 15 9
2 40 12 10 14
3 50 14 13 13
run;

proc sql;
  create table want as
  select *
   ,sum(south)+sum(north) as sum_south_north
   ,(sum(south)+sum(north))/(count(south)+count(north)) as mean_south_north
   ,sqrt(
         ((sum(north**2) + sum(south**2)) - (sum(north)+sum(south))**2 / (count(north)+count(south)))
         /
         (count(north)+count(south)-1)
        ) as std_south_north
  from have;
quit;

For a couple of column-pairs, this might be tolerable, but if there are lots of pairs, then you are asking for trouble. You're better off reshaping the data and then generating the STD more directly.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 04-28-2024 07:32 PM

You asked this question already and you have been told you need to convert your variables into a single variable.

select south as new_var from have
union all
select north as new_var from have

Then you can calculate mean, std, var, and other statistics on this new variable.

Find Standard deviation from 2, or more, columns/variables with aim of SQL

Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

SAS Innovate 2025: Call for Content

Classroom Training Available!