Good night guys:
Last post i tried to explain an example, but for some logistic reasons i couldn't finish the explanation there so here i go again:
i have this data set:
data have;
input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;
proc sql;
create table want as
select *,
sum(south + north)/(count(south + north)) as mean_south_north,
std(south and north) as sd_south_north
from have;
quit;
proc print data = want;
run;, but unfortunately i got this:
Obs | id | south | north | west | east | sum_south_north | mean_south_north | sd_south_north | sum_west_east | mean_west_east |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 10 | . | 15 | 9 | 116 | 25.2 | 8.48528 | 74 | 12.3333 |
2 | 2 | 40 | 12 | 10 | 14 | 116 | 25.2 | 8.48528 | 74 | 12.3333 |
3 | 3 | 50 | 14 | 13 | 13 | 116 | 25.2 | 8.48528 | 74 | 12.3333 |
The RED BOLD text in the table above represents the standard deviation wrongly found, where the correct value must be 18.4715998224301, as the real value of the standard deviation of 10, 40, 50, 12 and 14. Must be noted that it is a missing value in the data set, so it is very important to consider also.
Other important issue, it would be grate make the program to find the standard deviation from other columns and put them as other variable, for example the standard deviation of west and east observations , as std_west_east, considering GREEN DATA in the table above
i appreciate any help to complete this task correcting the code i provided to you above.
thanks in advance
There is no way your expression
std(south and north) as sd_south_north
can produce 8.4825 (or 18.4715998224301).
That's because the expression "south and north" is a logical expression yielding either a zero or one. So std(south and north) would be getting the STD of a collection of 1's and 0's.
Please provide the code (and data) you used to produce the unexpected results you show.
The data in your script that creates table Have doesn't match the desired result you show us in the data grid with the wrong numbers. Can you please fix/align this so we start with the correct data. And also please ensure that the desired data you show us is correct unless otherwise stated - like shouldn't sum_south_north show a value of 126?
You also mentioned in your earlier post that there will be up-to 500 variables for which you want do create such means and std. Can you please already now be specific how the "pattern" will look like. Will there always be pairs of variables or could it be any number of variables that you want to "combine" for such calculations. To be specific about this already now is important as else you might get code proposed that you can't easily "generalize" for all your other cases.
Hi Patrick and thanks for your answers. So now ill explain this example:
The last example explains that i have sets of variables, sometimes with distinct number of columns, and that why i need a basic sql code to get the standar deviation, eht mean was solved already in the last post, but now i need the standard deviation
data have;
input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;
proc sql;
create table want as
select *,
sum(south + north)/(count(south + north)) as mean_south_north,
std(south, north) as sd_south_north
from have;
quit;
proc print data = want;
run;
I know that my code isn't efficient enough to get the value of standard deviation from two, or more, columns, that's why i come here to be helped.
Now, my need is very simple, obtain the standar deviation from two or more numeric columns, includiing columns with lost observations, to put this exact value in a whole column to estimate new values and proceed with other tasks.
thanks for the comprehension
@jonatan_velarde I hope you realize that your current code returns a wrong result for the SUM of south north. It should be 126. The mean uses the correct sum: 126 - 126/5=25.2
The last example explains that i have sets of variables, sometimes with distinct number of columns
Variables and Columns are two terms for the same!
Variable/Column | |
Row/Observation/Record |
In your sample data you want the mean and std over two columns/variables treated as if if they were in a single column and multiple rows.
To provide code that's easily extensible for other such cases we really need to know if your "sets of variables" always come in pairs like in the sample data or if there could also be cases with only a single or more than two variables. Can you please clarify.
@jonatan_velarde wrote:
the data set i have to apply this new code does not have sets of pair of variables, sometimes are 47, other are 14, sometimes are 6. That's why i was asking a flexible code to find these values
If you are after flexible code then you need to tell us upfront what sort of flexibility you require and best provide representative sample data that already allows to test for such flexibility.
Below code returns the std number you desire for your sample data and the code should it also make rather easy to add additional sets of variables for the same sort of analysis.
To add additional groups of variables it should be sufficient to define additional groups in the Proc Format bit and also add these additional variables to the array statement under the next data step.
data have;
input id south north west east other_var;
cards;
1 10 . 15 9 1
2 40 12 10 14 2
3 50 14 13 13 3
;
/* transform source data into a structure better suitable for analysis */
proc format;
invalue $groups(upcase default=32)
'SOUTH','NORTH' = 'south_north'
'WEST','EAST' = 'west_east'
other = _same_
;
run;
data long(keep=analysis_group val);
set have;
array vars{*} south north west east other_var;
do i=1 to dim(vars);
analysis_group=input(vname(vars[i]),$groups.);
val=vars[i];
output;
end;
run;
/* analysis */
proc sql;
create table result_long as
select
analysis_group
,sum(val) as sum_
,mean(val) as mean_
,std(val) as std_
from long
group by analysis_group
;
quit;
/* transform analysis result data structure for merge to source data structure */
proc datasets lib=work nolist nowarn;
delete result_wide_:;
quit;
%let aggr=sum_;
proc transpose data=result_long out=result_wide_&aggr(drop=_name_) prefix=&aggr.;
id analysis_group;
var &aggr. ;
run;
%let aggr=mean_;
proc transpose data=result_long out=result_wide_&aggr(drop=_name_) prefix=&aggr.;
id analysis_group;
var &aggr. ;
run;
%let aggr=std_;
proc transpose data=result_long out=result_wide_&aggr(drop=_name_) prefix=&aggr.;
id analysis_group;
var &aggr. ;
run;
data result_wide;
merge result_wide_:;
run;
/* add analysis results to source data structure */
data want;
set have;
if _n_=1 then set result_wide;
run;
proc print data=want;
run;
@jonatan_velarde I'm now a bit confused why you accepted the answer that your code won't work as solution but haven't even had the courtesy to react on the solution I proposed which I believe addresses your requirements and returns the result that you defined as correct.
There is no way your expression
std(south and north) as sd_south_north
can produce 8.4825 (or 18.4715998224301).
That's because the expression "south and north" is a logical expression yielding either a zero or one. So std(south and north) would be getting the STD of a collection of 1's and 0's.
Please provide the code (and data) you used to produce the unexpected results you show.
The reason your code produces sd_south_north=8.485 (instead of 18.475) is the same as the reason it produces sum_south_north=116 (instead of 126). That's because it is treating north+south as a single calculated value for each row. This means
If you really must use sql to statistically treat two columns as if they were one long column, then you will have messy expressions, like
dm 'clear out';
data have;
input id south north west east;
cards;
1 10 . 15 9
2 40 12 10 14
3 50 14 13 13
run;
proc sql;
create table want as
select *
,sum(south)+sum(north) as sum_south_north
,(sum(south)+sum(north))/(count(south)+count(north)) as mean_south_north
,sqrt(
((sum(north**2) + sum(south**2)) - (sum(north)+sum(south))**2 / (count(north)+count(south)))
/
(count(north)+count(south)-1)
) as std_south_north
from have;
quit;
For a couple of column-pairs, this might be tolerable, but if there are lots of pairs, then you are asking for trouble. You're better off reshaping the data and then generating the STD more directly.
You asked this question already and you have been told you need to convert your variables into a single variable.
select south as new_var from have
union all
select north as new_var from have
Then you can calculate mean, std, var, and other statistics on this new variable.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.