Lapis Lazuli | Level 10

## Find Standard deviation from 2, or more, columns/variables with aim of SQL

Good night guys:

Last post i tried to explain an example, but for some logistic reasons i couldn't finish the explanation there so here i go again:

i have this data set:

data have;
input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;

proc sql;
create table want as
select *,
sum(south + north)/(count(south + north)) as mean_south_north,
std(south and north) as sd_south_north
from have;
quit;

proc print data = want;
run;, but unfortunately i got this:

Obs id south north west east sum_south_north mean_south_north sd_south_north sum_west_east mean_west_east
1 1 10 . 15 9 116 25.2 8.48528 74 12.3333
2 2 40 12 10 14 116 25.2 8.48528 74 12.3333
3 3 50 14 13 13 116 25.2 8.48528 74 12.3333

The RED BOLD text in the table above represents the standard deviation wrongly found, where the correct value must be  18.4715998224301, as the real value of the standard deviation of 10, 40, 50, 12 and 14. Must be noted that it is a missing value in the data set, so it is very important to consider also.

Other important issue, it would be grate make the program to find the standard deviation from other columns and put them as other variable, for example the standard deviation of west and east observations , as std_west_east, considering GREEN DATA in the table above

i appreciate any help to complete this task correcting the code i provided to you above.

1 ACCEPTED SOLUTION

Accepted Solutions
PROC Star

## Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

There is no way your expression

std(south and north) as sd_south_north

can produce 8.4825  (or 18.4715998224301).

That's because the expression "south and north" is a logical expression yielding either a zero or one.  So std(south and north) would be getting the STD of a collection of 1's and 0's.

Please provide the code (and data) you used to produce the unexpected results you show.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
11 REPLIES 11
Opal | Level 21

## Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

The data in your script that creates table Have doesn't match the desired result you show us in the data grid with the wrong numbers. Can you please fix/align this so we start with the correct data. And also please ensure that the desired data you show us is correct unless otherwise stated - like shouldn't sum_south_north show a value of 126?

You also mentioned in your earlier post that there will be up-to 500 variables for which you want do create such means and std. Can you please already now be specific how the "pattern" will look like. Will there always be pairs of variables or could it be any number of variables that you want to "combine" for such calculations. To be specific about this already now is important as else you might get code proposed that you can't easily "generalize" for all your other cases.

Lapis Lazuli | Level 10

## Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

Hi Patrick and thanks for your answers. So now ill explain this example:

The last example explains that i have sets of variables, sometimes with distinct number of columns, and that why i need a basic sql code to get the standar deviation, eht mean was solved already in the last post, but now i need the standard deviation

data have;
input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;

proc sql;
create table want as
select *,
sum(south + north)/(count(south + north)) as mean_south_north,
std(south, north) as sd_south_north
from have;
quit;

proc print data = want;
run;

I know that my code isn't efficient enough to get the value of standard deviation from two, or more, columns, that's why i come here to be helped.

Now,  my need is very simple, obtain the standar deviation from two or more numeric columns, includiing columns with lost observations, to put this exact value in a whole column to estimate new values and proceed with other tasks.

thanks for the comprehension

Lapis Lazuli | Level 10

## Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

this is the base code:

sorry for the mistake

data have;
input id south north west east;
cards;
1 10 . 15 9
2 40 12 10 14
3 50 14 13 13
;

proc sql;
create table want as
select *,
sum(south + north) as sum_south_north,
(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north,
std(south + north) as sd_south_north,
sum(west + east) as sum_west_east,
(sum(west) + sum(east))/(count(west)+count(east)) as mean_west_east,
std(west + east) as sd_south_north
from have;
quit;

proc print data = want;
run;
Opal | Level 21

## Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

@jonatan_velarde I hope you realize that your current code returns a wrong result for the SUM of south north. It should be 126. The mean uses the correct sum:  126 - 126/5=25.2

The last example explains that i have sets of variables, sometimes with distinct number of columns

Variables and Columns are two terms for the same!

 Variable/Column Row/Observation/Record

In your sample data you want the mean and std over two columns/variables treated as if if they were in a single column and multiple rows.

To provide code that's easily extensible for other such cases we really need to know if your "sets of variables" always come in pairs like in the sample data or if there could also be cases with only a single or more than two variables. Can you please clarify.

Lapis Lazuli | Level 10

## Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

the data set i have to apply this new code does not have sets of pair of variables, sometimes are 47, other are 14, sometimes are 6. That's why i was asking a flexible code to find these values

Opal | Level 21

## Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

@jonatan_velarde wrote:
the data set i have to apply this new code does not have sets of pair of variables, sometimes are 47, other are 14, sometimes are 6. That's why i was asking a flexible code to find these values

If you are after flexible code then you need to tell us upfront what sort of flexibility you require and best provide representative sample data that already allows to test for such flexibility.

Below code returns the std number you desire for your sample data and the code should it also make rather easy to add additional sets of variables for the same sort of analysis.

To add additional groups of variables it should be sufficient to define additional groups in the Proc Format bit and also add these additional variables to the array statement under the next data step.

data have;
input id south north west east other_var;
cards;
1 10 . 15 9 1
2 40 12 10 14 2
3 50 14 13 13 3
;

/* transform source data into a structure better suitable for analysis */
proc format;
invalue \$groups(upcase default=32)
'SOUTH','NORTH' = 'south_north'
'WEST','EAST'    = 'west_east'
other = _same_
;
run;

data long(keep=analysis_group val);
set have;
array vars{*} south north west east other_var;
do i=1 to dim(vars);
analysis_group=input(vname(vars[i]),\$groups.);
val=vars[i];
output;
end;
run;

/* analysis */
proc sql;
create table result_long as
select
analysis_group
,sum(val) as sum_
,mean(val) as mean_
,std(val) as std_
from long
group by analysis_group
;
quit;

/* transform analysis result data structure for merge to source data structure */
proc datasets lib=work nolist nowarn;
delete result_wide_:;
quit;

%let aggr=sum_;
proc transpose data=result_long out=result_wide_&aggr(drop=_name_) prefix=&aggr.;
id analysis_group;
var &aggr. ;
run;
%let aggr=mean_;
proc transpose data=result_long out=result_wide_&aggr(drop=_name_) prefix=&aggr.;
id analysis_group;
var &aggr. ;
run;
%let aggr=std_;
proc transpose data=result_long out=result_wide_&aggr(drop=_name_) prefix=&aggr.;
id analysis_group;
var &aggr. ;
run;

data result_wide;
merge result_wide_:;
run;

/* add analysis results to source data structure */
data want;
set have;
if _n_=1 then set result_wide;
run;

proc print data=want;
run;

Opal | Level 21

## Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

@jonatan_velarde I'm now a bit confused why you accepted the answer that your code won't work as solution but haven't even had the courtesy to react on the solution I proposed which I believe addresses your requirements and returns the result that you defined as correct.

PROC Star

## Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

There is no way your expression

std(south and north) as sd_south_north

can produce 8.4825  (or 18.4715998224301).

That's because the expression "south and north" is a logical expression yielding either a zero or one.  So std(south and north) would be getting the STD of a collection of 1's and 0's.

Please provide the code (and data) you used to produce the unexpected results you show.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Lapis Lazuli | Level 10

## Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

here is the correct code:

data have;
input id south north west east;
cards;
1 10 . 15 9
2 40 12 10 14
3 50 14 13 13
;

proc sql;
create table want as
select *,
sum(south + north) as sum_south_north,
(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north,
std(south + north) as sd_south_north,
sum(west + east) as sum_west_east,
(sum(west) + sum(east))/(count(west)+count(east)) as mean_west_east,
std(west + east) as sd_south_north
from have;
quit;

proc print data = want;
run;
PROC Star

## Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

The reason your code produces sd_south_north=8.485 (instead of 18.475) is the same as the reason it produces sum_south_north=116 (instead of 126).  That's because it is treating north+south as a single calculated value for each row.  This means

1. It excludes row 1, because north is missing, making north+sourth missing.
2. It provides statistics on the sum calculated for each row, not the individual components of the sum.   I.e. the standard deviation is   std(40+12,50+14)=  std(52,64) = 8.485.

If you really must use sql to statistically treat two columns as if they were one long column, then you will have messy expressions, like

dm 'clear out';
data have;
input id south north west east;
cards;
1 10 . 15 9
2 40 12 10 14
3 50 14 13 13
run;

proc sql;
create table want as
select *
,sum(south)+sum(north) as sum_south_north
,(sum(south)+sum(north))/(count(south)+count(north)) as mean_south_north
,sqrt(
((sum(north**2) + sum(south**2)) - (sum(north)+sum(south))**2 / (count(north)+count(south)))
/
(count(north)+count(south)-1)
) as std_south_north
from have;
quit;

For a couple of column-pairs, this might be tolerable, but if there are lots of pairs, then you are asking for trouble.   You're better off reshaping the data and then generating the STD more directly.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Super User

## Re: Find Standard deviation from 2, or more, columns/variables with aim of SQL

You asked this question already and you have been told you need to convert your variables into a single variable.

select south as new_var from have
union all
select north as new_var from have

Then you can calculate mean, std, var, and other statistics on this new variable.

Discussion stats
• 11 replies
• 711 views
• 0 likes
• 4 in conversation