i have this data corresponding to many participants, and were collected data from them in different locations.
id | South | North | east | west |
1 | 2 | 4 | 12 | 9 |
2 | 3 | . | 11 | 10 |
3 | 2 | 4 | 12 | 10 |
4 | 6 | 4 | 13 | 12 |
now i need to obtain the mean ( or average) of all of them and put it into columns, as well as standard deviation.
Repair that both mean and standard deviation are considering all observations, in sql i don't want to divide the sum by the number of the observations, is not fashion LOL.
i would like to obtain this:
id | South | North | mean_south_north | SD_south_north | east | west | mean_east_west | SD_east_west |
1 | 2 | 4 | 3,57142857142857 | 1,39727626201154 | 12 | 9 | 11,125 | 1,35620268186054 |
2 | 3 | . | 3,57142857142857 | 1,39727626201154 | 11 | 10 | 11,125 | 1,35620268186054 |
3 | 2 | 4 | 3,57142857142857 | 1,39727626201154 | 12 | 10 | 11,125 | 1,35620268186054 |
4 | 6 | 4 | 3,57142857142857 | 1,39727626201154 | 13 | 12 | 11,125 | 1,35620268186054 |
the first mean comes from the observations obtained in south and north, and the second, on east and west. as well their respective standard deviations.
Thanks in advance
I already showed you how to do that. You have to put those two "columns" of North and South into one "column" so that you can then use the MEAN() (or as it is commonly called in many SQL dialects AVERAGE()) aggregate function. If your SQL dialect supports it you could also use a STD() aggregate function (by whatever name your SQL dialect chooses to use for it).
You could try crafting your own logic for generating statistics.
The mean is just the sum over the count. So that is simple.
(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north
But the formula for the standard deviation is much more complicated. You would need to remerge the mean back onto every observation, find the difference, etc. etc.
Also why the heck do you want the same value repeated onto every observation.
Where are you going with this?
Also how did you end up with that input data structured where you have the values for a single variable split into two variables? It would be much easier in SQL if the data was structured with three variables (ID, DIRECTION, COUNT) and 16 observations instead of 5 variables and 4 observations.
Not a SQL solution but using PROC TRANSPOSE and PROC MEANS.
data loc;
input id South North east west;
datalines;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;
proc transpose data = loc out = locLong(rename = (_NAME_ = Location COL1 = Response));
by id;
run;
proc means data = locLong noprint;
var Response;
where Location in ("South", "North");
output out = southNorth(drop = _TYPE_ _FREQ_) mean = stddev = / autoname;
run;
proc means data = locLong noprint;
var Response;
where Location in ("east", "west");
output out = eastWest(drop = _TYPE_ _FREQ_) mean = stddev = / autoname;
run;
data locFinal;
retain mean_south_north SD_south_north;
merge loc southNorth(rename = (Response_Mean = meansouth_north Response_StdDev = SDsouth_north)) eastWest(rename = (Response_Mean = meaneast_west Response_StdDev = SDeast_west));
mean_south_north + meansouth_north;
SD_south_north + SDsouth_north;
mean_east_west + meaneast_west;
SD_east_west + SDeast_west;
drop meansouth_north SDsouth_north meaneast_west SDeast_west;
run;
proc print data = locFinal;
var id South North mean_south_north SD_south_north east west mean_east_west SD_east_west;
run;
@jonatan_velarde wrote:
unfortunately i need sql code, because i will use this code in a database containing more that 500 columns, so thanks for the effort, I'm sure someone will use it..
Please provide as much as you can all relevant information upfront.
- What database please?
- What data volume (number of rows in the table)?
- 500 columns in a single table?
- Is there some group id in your source table? I don't expect that you want a single aggregated value for all rows.
- Please try to share representative data
- Please share this data via data step code that generates it so we can spend the time answering your question instead of preparing sample data.
If your database has 500 columns and you need to do this on all 500 (in pairs), SQL is probably the worst tool to choose.
PROC MEANS/PROC SUMMARY will produce the means or sums for all 500 variables, you can then in a subsequent data step combine the pairs of means (weighting properly) and pairs of standard deviations (weighting properly).
Also see In Database Processing for PROC MEANS
UNTESTED CODE
%let all_names = <your 500 variable names>;
proc means data=yourdatabasetable noprint;
var &all_names;
output out=stats sum= n= stddev=/autoname;
run;
proc sql noprint;
select cats(name,'_sum') into :sums separated by ' ' from dictionary.columns where lowcase(name) ? '_sum'
and libname="WORK" and memname="STATS" order by varnum;
select cats(name,'n') into :ns separated by ' ' from dictionary.columns where lowcase(name) ? '_n'
and libname="WORK" and memname="STATS" order by varnum;
select cats(name,'stddev') into :stds separated by ' ' from dictionary.columns where lowcase(name) ? '_stddev'
and libname="WORK" and memname="STATS" order by varnum;
quit;
data final;
set stats;
array _s &sums;
array _n &ns;
array std &stds;
array combined_mean mean1-mean500;
array combined_std std1-std500;
do i=1 to dim(_n) by 2;
combined_mean(i)=sum(_s(i),s(i+1))/sum(_n(i),_n(i+1));
combined_std(i)= /* Formula is at https://www.statology.org/weighted-standard-deviation-excel/ */
/* I'm not going to program it now */ ;
end;
run;
i found A clue:
proc sql;
create table want as
select *,
sum(cost + best) as sum1,
mean(cost best) as mean1,
std(cost + best) as sd1,
sum(west + east) as sum2,
mean(west + east) as mean2,
std(west + east) as sd2
from have;
quit;
but i cant get the correct mean.
thanks in advance
@jonatan_velarde wrote:
i found A clue:
proc sql;
create table want as
select *,
sum(cost + best) as sum1,
mean(cost best) as mean1,
std(cost + best) as sd1,
sum(west + east) as sum2,
mean(west + east) as mean2,
std(west + east) as sd2
from have;
quit;
but i cant get the correct mean.
thanks in advance
Your picture of the data doesn't show variables COST or BEST. So why are you including them here?
I do not understand what you want, please describe in much more detail.
On your first observation the value of South is 2 and of North is 4. The mean of 2 and 4 is exactly 3. So why do you have 3.57... as the mean on that observation?
If you want to treat North and South as the same variable then make them the same variable.
First let's convert your LISTING into an actual DATASET.
data have;
input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;
One way to make NORTH/SOUTH and EAST/WEST into the same variables is to just stack the two together.
proc sql;
select mean(north_south) as mean_north_south
, mean(east_west) as mean_east_west
from
(select north as north_south
, east as east_west
from have
union all
select south as north_south
, west as east_west
from have
);
quit;
Result:
mean_north_ mean_east_ south west -------------------------- 3.571429 11.125
But if you really have 500 variables that is going to extremely painful to code in PROC SQL.
Why not just fix the data structure first. Then it will be easier.
proc transpose data=have out=tall;
by id;
var south -- west ;
run;
proc format ;
value $nsew
'North','north','South','south'='North-South'
'East','east','West','west'='East-West'
;
run;
proc means data=tall mean std;
class _name_;
format _name_ $nsew.;
var col1;
run;
Result
The MEANS Procedure Analysis Variable : COL1 NAME OF FORMER N VARIABLE Obs Mean Std Dev -------------------------------------------------- North-South 8 3.5714286 1.3972763 East-West 8 11.1250000 1.3562027 --------------------------------------------------
So your terminology is confused, which is making it hard for you to explain your request.
DATASETS consist of OBSERVATIONS and VARIABLES. Your example dataset has only 4 observations and 5 variables. If you want to take the mean of all of the VALUES in both variables then that is what I showed you how to do.
If you want the count of the number of non missing values from PROC MEANS then ask for it. It can also tell you the number of missing values.
proc means data=tall n nmiss mean std;
class _name_;
format _name_ $nsew.;
var col1;
run;
Result
The MEANS Procedure Analysis Variable : COL1 NAME OF FORMER N N VARIABLE Obs N Miss Mean Std Dev ---------------------------------------------------------------- North-South 8 7 1 3.5714286 1.3972763 East-West 8 8 0 11.1250000 1.3562027 ----------------------------------------------------------------
Hi Tom:
Thanks for your answer. Yeah, you are perfectly right, four observations for each variable:
data have:
id | South | North | east | west |
1 | 2 | 4 | 12 | 9 |
2 | 3 | . | 11 | 10 |
3 | 2 | 4 | 12 | 10 |
4 | 6 | 4 | 13 | 12 |
using sql i would like to obtain:
id | South | North | mean_south_north | SD_south_north | east | west | mean_east_west | SD_east_west |
1 | 2 | 4 | 3,57142857142857 | 1,39727626201154 | 12 | 9 | 11,125 | 1,35620268186054 |
2 | 3 | . | 3,57142857142857 | 1,39727626201154 | 11 | 10 | 11,125 | 1,35620268186054 |
3 | 2 | 4 | 3,57142857142857 | 1,39727626201154 | 12 | 10 | 11,125 | 1,35620268186054 |
4 | 6 | 4 | 3,57142857142857 | 1,39727626201154 | 13 | 12 | 11,125 | 1,35620268186054 |
where mean_south_north contains the arithmetic average, known as mean, from every observation in the columns South and North, in BLUE BOLD, as follows: 2, 3, 2, 6, 4, 4, 4; using the same thinking, SD_south_north contains the standard deviation from BLUE BOLD data, from the 4 observations and variables South and North.
I already showed you how to do that. You have to put those two "columns" of North and South into one "column" so that you can then use the MEAN() (or as it is commonly called in many SQL dialects AVERAGE()) aggregate function. If your SQL dialect supports it you could also use a STD() aggregate function (by whatever name your SQL dialect chooses to use for it).
You could try crafting your own logic for generating statistics.
The mean is just the sum over the count. So that is simple.
(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north
But the formula for the standard deviation is much more complicated. You would need to remerge the mean back onto every observation, find the difference, etc. etc.
Also why the heck do you want the same value repeated onto every observation.
Where are you going with this?
Also how did you end up with that input data structured where you have the values for a single variable split into two variables? It would be much easier in SQL if the data was structured with three variables (ID, DIRECTION, COUNT) and 16 observations instead of 5 variables and 4 observations.
Thank you for your correct answer, now ill answer you some questions:
Also why the heck do you want the same value repeated onto every observation.
Where are you going with this?
IT IS NOT YOUR BUSINESS!,
Also how did you end up with that input data structured where you have the values for a single variable split into two variables?
YES, NOW I CAN WORK WITH MORE THAN 1000 VARIABLES
It would be much easier in SQL if the data was structured with three variables (ID, DIRECTION, COUNT) and 16 observations instead of 5 variables and 4 observations.
IT WOULD BE EASIER TO KEEP THE STRUCTURE OF QUE QUESTION, AND MOVE ON!
🙂
data have;
input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;
proc sql;
create table want as
select *,
sum(south + north) as sum_south_north,
(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north,
std(south + north) as sd_south_north,
sum(west + east) as sum_west_east,
(sum(west) + sum(east))/(count(west)+count(east)) as mean_west_east,
std(west + east) as sd_south_north
from have;
quit;
proc print data = want;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.