- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
i have this data corresponding to many participants, and were collected data from them in different locations.
id | South | North | east | west |
1 | 2 | 4 | 12 | 9 |
2 | 3 | . | 11 | 10 |
3 | 2 | 4 | 12 | 10 |
4 | 6 | 4 | 13 | 12 |
now i need to obtain the mean ( or average) of all of them and put it into columns, as well as standard deviation.
Repair that both mean and standard deviation are considering all observations, in sql i don't want to divide the sum by the number of the observations, is not fashion LOL.
i would like to obtain this:
id | South | North | mean_south_north | SD_south_north | east | west | mean_east_west | SD_east_west |
1 | 2 | 4 | 3,57142857142857 | 1,39727626201154 | 12 | 9 | 11,125 | 1,35620268186054 |
2 | 3 | . | 3,57142857142857 | 1,39727626201154 | 11 | 10 | 11,125 | 1,35620268186054 |
3 | 2 | 4 | 3,57142857142857 | 1,39727626201154 | 12 | 10 | 11,125 | 1,35620268186054 |
4 | 6 | 4 | 3,57142857142857 | 1,39727626201154 | 13 | 12 | 11,125 | 1,35620268186054 |
the first mean comes from the observations obtained in south and north, and the second, on east and west. as well their respective standard deviations.
Thanks in advance
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I already showed you how to do that. You have to put those two "columns" of North and South into one "column" so that you can then use the MEAN() (or as it is commonly called in many SQL dialects AVERAGE()) aggregate function. If your SQL dialect supports it you could also use a STD() aggregate function (by whatever name your SQL dialect chooses to use for it).
You could try crafting your own logic for generating statistics.
The mean is just the sum over the count. So that is simple.
(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north
But the formula for the standard deviation is much more complicated. You would need to remerge the mean back onto every observation, find the difference, etc. etc.
Also why the heck do you want the same value repeated onto every observation.
Where are you going with this?
Also how did you end up with that input data structured where you have the values for a single variable split into two variables? It would be much easier in SQL if the data was structured with three variables (ID, DIRECTION, COUNT) and 16 observations instead of 5 variables and 4 observations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Not a SQL solution but using PROC TRANSPOSE and PROC MEANS.
data loc;
input id South North east west;
datalines;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;
proc transpose data = loc out = locLong(rename = (_NAME_ = Location COL1 = Response));
by id;
run;
proc means data = locLong noprint;
var Response;
where Location in ("South", "North");
output out = southNorth(drop = _TYPE_ _FREQ_) mean = stddev = / autoname;
run;
proc means data = locLong noprint;
var Response;
where Location in ("east", "west");
output out = eastWest(drop = _TYPE_ _FREQ_) mean = stddev = / autoname;
run;
data locFinal;
retain mean_south_north SD_south_north;
merge loc southNorth(rename = (Response_Mean = meansouth_north Response_StdDev = SDsouth_north)) eastWest(rename = (Response_Mean = meaneast_west Response_StdDev = SDeast_west));
mean_south_north + meansouth_north;
SD_south_north + SDsouth_north;
mean_east_west + meaneast_west;
SD_east_west + SDeast_west;
drop meansouth_north SDsouth_north meaneast_west SDeast_west;
run;
proc print data = locFinal;
var id South North mean_south_north SD_south_north east west mean_east_west SD_east_west;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@jonatan_velarde wrote:
unfortunately i need sql code, because i will use this code in a database containing more that 500 columns, so thanks for the effort, I'm sure someone will use it..
Please provide as much as you can all relevant information upfront.
- What database please?
- What data volume (number of rows in the table)?
- 500 columns in a single table?
- Is there some group id in your source table? I don't expect that you want a single aggregated value for all rows.
- Please try to share representative data
- Please share this data via data step code that generates it so we can spend the time answering your question instead of preparing sample data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Only need sql because is resumed and efficient.
The database i posted as example will let me do the job for the other. that's why i showed you this here.
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If your database has 500 columns and you need to do this on all 500 (in pairs), SQL is probably the worst tool to choose.
PROC MEANS/PROC SUMMARY will produce the means or sums for all 500 variables, you can then in a subsequent data step combine the pairs of means (weighting properly) and pairs of standard deviations (weighting properly).
Also see In Database Processing for PROC MEANS
UNTESTED CODE
%let all_names = <your 500 variable names>;
proc means data=yourdatabasetable noprint;
var &all_names;
output out=stats sum= n= stddev=/autoname;
run;
proc sql noprint;
select cats(name,'_sum') into :sums separated by ' ' from dictionary.columns where lowcase(name) ? '_sum'
and libname="WORK" and memname="STATS" order by varnum;
select cats(name,'n') into :ns separated by ' ' from dictionary.columns where lowcase(name) ? '_n'
and libname="WORK" and memname="STATS" order by varnum;
select cats(name,'stddev') into :stds separated by ' ' from dictionary.columns where lowcase(name) ? '_stddev'
and libname="WORK" and memname="STATS" order by varnum;
quit;
data final;
set stats;
array _s &sums;
array _n &ns;
array std &stds;
array combined_mean mean1-mean500;
array combined_std std1-std500;
do i=1 to dim(_n) by 2;
combined_mean(i)=sum(_s(i),s(i+1))/sum(_n(i),_n(i+1));
combined_std(i)= /* Formula is at https://www.statology.org/weighted-standard-deviation-excel/ */
/* I'm not going to program it now */ ;
end;
run;
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
i found A clue:
proc sql;
create table want as
select *,
sum(cost + best) as sum1,
mean(cost best) as mean1,
std(cost + best) as sd1,
sum(west + east) as sum2,
mean(west + east) as mean2,
std(west + east) as sd2
from have;
quit;
but i cant get the correct mean.
thanks in advance
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@jonatan_velarde wrote:
i found A clue:
proc sql;
create table want as
select *,
sum(cost + best) as sum1,
mean(cost best) as mean1,
std(cost + best) as sd1,
sum(west + east) as sum2,
mean(west + east) as mean2,
std(west + east) as sd2
from have;
quit;
but i cant get the correct mean.
thanks in advance
Your picture of the data doesn't show variables COST or BEST. So why are you including them here?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I do not understand what you want, please describe in much more detail.
On your first observation the value of South is 2 and of North is 4. The mean of 2 and 4 is exactly 3. So why do you have 3.57... as the mean on that observation?
If you want to treat North and South as the same variable then make them the same variable.
First let's convert your LISTING into an actual DATASET.
data have;
input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;
One way to make NORTH/SOUTH and EAST/WEST into the same variables is to just stack the two together.
proc sql;
select mean(north_south) as mean_north_south
, mean(east_west) as mean_east_west
from
(select north as north_south
, east as east_west
from have
union all
select south as north_south
, west as east_west
from have
);
quit;
Result:
mean_north_ mean_east_ south west -------------------------- 3.571429 11.125
But if you really have 500 variables that is going to extremely painful to code in PROC SQL.
Why not just fix the data structure first. Then it will be easier.
proc transpose data=have out=tall;
by id;
var south -- west ;
run;
proc format ;
value $nsew
'North','north','South','south'='North-South'
'East','east','West','west'='East-West'
;
run;
proc means data=tall mean std;
class _name_;
format _name_ $nsew.;
var col1;
run;
Result
The MEANS Procedure Analysis Variable : COL1 NAME OF FORMER N VARIABLE Obs Mean Std Dev -------------------------------------------------- North-South 8 3.5714286 1.3972763 East-West 8 11.1250000 1.3562027 --------------------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Analysis Variable : COL1
NAME OF
FORMER N
VARIABLE Obs Mean Std Dev
--------------------------------------------------
North-South "7" 3.5714286 1.3972763
East-West 8 11.1250000 1.3562027
--------------------------------------------------
There are only seven information for the first mean.
It is needed to find the whole mean, from all the observations contained in both columns
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So your terminology is confused, which is making it hard for you to explain your request.
DATASETS consist of OBSERVATIONS and VARIABLES. Your example dataset has only 4 observations and 5 variables. If you want to take the mean of all of the VALUES in both variables then that is what I showed you how to do.
If you want the count of the number of non missing values from PROC MEANS then ask for it. It can also tell you the number of missing values.
proc means data=tall n nmiss mean std;
class _name_;
format _name_ $nsew.;
var col1;
run;
Result
The MEANS Procedure Analysis Variable : COL1 NAME OF FORMER N N VARIABLE Obs N Miss Mean Std Dev ---------------------------------------------------------------- North-South 8 7 1 3.5714286 1.3972763 East-West 8 8 0 11.1250000 1.3562027 ----------------------------------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tom:
Thanks for your answer. Yeah, you are perfectly right, four observations for each variable:
data have:
id | South | North | east | west |
1 | 2 | 4 | 12 | 9 |
2 | 3 | . | 11 | 10 |
3 | 2 | 4 | 12 | 10 |
4 | 6 | 4 | 13 | 12 |
using sql i would like to obtain:
id | South | North | mean_south_north | SD_south_north | east | west | mean_east_west | SD_east_west |
1 | 2 | 4 | 3,57142857142857 | 1,39727626201154 | 12 | 9 | 11,125 | 1,35620268186054 |
2 | 3 | . | 3,57142857142857 | 1,39727626201154 | 11 | 10 | 11,125 | 1,35620268186054 |
3 | 2 | 4 | 3,57142857142857 | 1,39727626201154 | 12 | 10 | 11,125 | 1,35620268186054 |
4 | 6 | 4 | 3,57142857142857 | 1,39727626201154 | 13 | 12 | 11,125 | 1,35620268186054 |
where mean_south_north contains the arithmetic average, known as mean, from every observation in the columns South and North, in BLUE BOLD, as follows: 2, 3, 2, 6, 4, 4, 4; using the same thinking, SD_south_north contains the standard deviation from BLUE BOLD data, from the 4 observations and variables South and North.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I already showed you how to do that. You have to put those two "columns" of North and South into one "column" so that you can then use the MEAN() (or as it is commonly called in many SQL dialects AVERAGE()) aggregate function. If your SQL dialect supports it you could also use a STD() aggregate function (by whatever name your SQL dialect chooses to use for it).
You could try crafting your own logic for generating statistics.
The mean is just the sum over the count. So that is simple.
(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north
But the formula for the standard deviation is much more complicated. You would need to remerge the mean back onto every observation, find the difference, etc. etc.
Also why the heck do you want the same value repeated onto every observation.
Where are you going with this?
Also how did you end up with that input data structured where you have the values for a single variable split into two variables? It would be much easier in SQL if the data was structured with three variables (ID, DIRECTION, COUNT) and 16 observations instead of 5 variables and 4 observations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your correct answer, now ill answer you some questions:
Also why the heck do you want the same value repeated onto every observation.
Where are you going with this?
IT IS NOT YOUR BUSINESS!,
Also how did you end up with that input data structured where you have the values for a single variable split into two variables?
YES, NOW I CAN WORK WITH MORE THAN 1000 VARIABLES
It would be much easier in SQL if the data was structured with three variables (ID, DIRECTION, COUNT) and 16 observations instead of 5 variables and 4 observations.
IT WOULD BE EASIER TO KEEP THE STRUCTURE OF QUE QUESTION, AND MOVE ON!
🙂
data have;
input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;
proc sql;
create table want as
select *,
sum(south + north) as sum_south_north,
(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north,
std(south + north) as sd_south_north,
sum(west + east) as sum_west_east,
(sum(west) + sum(east))/(count(west)+count(east)) as mean_west_east,
std(west + east) as sd_south_north
from have;
quit;
proc print data = want;
run;