Solved: Obtain the mean from two or more columns using SQL and put it as a new...

jonatan_velarde · Posted 04-26-2024 07:42 PM

i have this data corresponding to many participants, and were collected data from them in different locations.

id	South	North	east	west
1	2	4	12	9
2	3	.	11	10
3	2	4	12	10
4	6	4	13	12

now i need to obtain the mean ( or average) of all of them and put it into columns, as well as standard deviation.

Repair that both mean and standard deviation are considering all observations, in sql i don't want to divide the sum by the number of the observations, is not fashion LOL.

i would like to obtain this:

id	South	North	mean_south_north	SD_south_north	east	west	mean_east_west	SD_east_west
1	2	4	3,57142857142857	1,39727626201154	12	9	11,125	1,35620268186054
2	3	.	3,57142857142857	1,39727626201154	11	10	11,125	1,35620268186054
3	2	4	3,57142857142857	1,39727626201154	12	10	11,125	1,35620268186054
4	6	4	3,57142857142857	1,39727626201154	13	12	11,125	1,35620268186054

the first mean comes from the observations obtained in south and north, and the second, on east and west. as well their respective standard deviations.

Thanks in advance

Tom · Posted 04-27-2024 11:48 AM

I already showed you how to do that. You have to put those two "columns" of North and South into one "column" so that you can then use the MEAN() (or as it is commonly called in many SQL dialects AVERAGE()) aggregate function. If your SQL dialect supports it you could also use a STD() aggregate function (by whatever name your SQL dialect chooses to use for it).

You could try crafting your own logic for generating statistics.

The mean is just the sum over the count. So that is simple.

(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north

But the formula for the standard deviation is much more complicated. You would need to remerge the mean back onto every observation, find the difference, etc. etc.

Also why the heck do you want the same value repeated onto every observation.

Where are you going with this?

Also how did you end up with that input data structured where you have the values for a single variable split into two variables? It would be much easier in SQL if the data was structured with three variables (ID, DIRECTION, COUNT) and 16 observations instead of 5 variables and 4 observations.

View solution in original post

dpalmer1 · Posted 04-26-2024 09:15 PM

Not a SQL solution but using PROC TRANSPOSE and PROC MEANS.

data loc;
	input id South North east west;
	datalines;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;

proc transpose data = loc out = locLong(rename = (_NAME_ = Location COL1 = Response));
	by id;
run;

proc means data = locLong noprint;
	var Response;
	where Location in ("South", "North");
	output  out = southNorth(drop = _TYPE_ _FREQ_) mean = stddev = / autoname;
run;

proc means data = locLong noprint;
	var Response;
	where Location in ("east", "west");
	output  out = eastWest(drop = _TYPE_ _FREQ_) mean = stddev = / autoname;
run;

data locFinal;
	retain mean_south_north SD_south_north;
	merge loc southNorth(rename = (Response_Mean = meansouth_north Response_StdDev = SDsouth_north)) eastWest(rename = (Response_Mean = meaneast_west Response_StdDev = SDeast_west));
	mean_south_north + meansouth_north;
	SD_south_north + SDsouth_north;
	mean_east_west + meaneast_west;
	SD_east_west + SDeast_west;
	drop meansouth_north SDsouth_north meaneast_west SDeast_west;
run;

proc print data = locFinal;
	var id South North mean_south_north SD_south_north east west mean_east_west SD_east_west;
run;

jonatan_velarde · Posted 04-26-2024 09:42 PM

unfortunately i need sql code, because i will use this code in a database containing more that 500 columns, so thanks for the effort, I'm sure someone will use it..

Patrick · Posted 04-26-2024 09:59 PM

@jonatan_velarde wrote:
unfortunately i need sql code, because i will use this code in a database containing more that 500 columns, so thanks for the effort, I'm sure someone will use it..

Please provide as much as you can all relevant information upfront.

- What database please?

- What data volume (number of rows in the table)?

- 500 columns in a single table?

- Is there some group id in your source table? I don't expect that you want a single aggregated value for all rows.

- Please try to share representative data

- Please share this data via data step code that generates it so we can spend the time answering your question instead of preparing sample data.

jonatan_velarde · Posted 04-27-2024 07:24 AM

Hi there!!!
Only need sql because is resumed and efficient.

The database i posted as example will let me do the job for the other. that's why i showed you this here.

Thank you

jonatan_velarde · Posted 04-27-2024 07:52 AM

The example above contains all the information needed to be replied in a big data base. Thank you for your comprehension.

PaigeMiller · Posted 04-27-2024 06:54 AM

If your database has 500 columns and you need to do this on all 500 (in pairs), SQL is probably the worst tool to choose.

PROC MEANS/PROC SUMMARY will produce the means or sums for all 500 variables, you can then in a subsequent data step combine the pairs of means (weighting properly) and pairs of standard deviations (weighting properly).

Also see In Database Processing for PROC MEANS

UNTESTED CODE

%let all_names = <your 500 variable names>;

proc means data=yourdatabasetable noprint;
    var &all_names;
    output out=stats sum= n= stddev=/autoname;
run;

proc sql noprint;
    select cats(name,'_sum') into :sums separated by ' ' from dictionary.columns where lowcase(name) ? '_sum'
        and libname="WORK" and memname="STATS" order by varnum;
    select cats(name,'n') into :ns separated by ' ' from dictionary.columns where lowcase(name) ? '_n' 
        and libname="WORK" and memname="STATS" order by varnum;
    select cats(name,'stddev') into :stds separated by ' ' from dictionary.columns where lowcase(name) ? '_stddev' 
        and libname="WORK" and memname="STATS" order by varnum;
quit;

data final;
    set stats;
    array _s &sums;
    array _n &ns;
    array std &stds;
    array combined_mean mean1-mean500;
    array combined_std std1-std500;
    do i=1 to dim(_n) by 2;
        combined_mean(i)=sum(_s(i),s(i+1))/sum(_n(i),_n(i+1));
        combined_std(i)= /* Formula is at https://www.statology.org/weighted-standard-deviation-excel/ */
            /* I'm not going to program it now */ ;
    end;
run;

--
Paige Miller

jonatan_velarde · Posted 04-26-2024 10:05 PM

i found A clue:

proc sql;
create table want as
select *,
sum(cost + best) as sum1,
mean(cost best) as mean1,
std(cost + best) as sd1,
sum(west + east) as sum2,
mean(west + east) as mean2,
std(west + east) as sd2
from have;
quit;

but i cant get the correct mean.

thanks in advance

ballardw · Posted 04-27-2024 02:12 AM

@jonatan_velarde wrote:

i found A clue:

proc sql;
create table want as
select *,
sum(cost + best) as sum1,
mean(cost best) as mean1,
std(cost + best) as sd1,
sum(west + east) as sum2,
mean(west + east) as mean2,
std(west + east) as sd2
from have;
quit;

but i cant get the correct mean.

thanks in advance

Your picture of the data doesn't show variables COST or BEST. So why are you including them here?

Tom · Posted 04-27-2024 12:24 AM

I do not understand what you want, please describe in much more detail.

On your first observation the value of South is 2 and of North is 4. The mean of 2 and 4 is exactly 3. So why do you have 3.57... as the mean on that observation?

If you want to treat North and South as the same variable then make them the same variable.

First let's convert your LISTING into an actual DATASET.

data have;
  input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;

One way to make NORTH/SOUTH and EAST/WEST into the same variables is to just stack the two together.

proc sql;
select mean(north_south) as mean_north_south
     , mean(east_west) as mean_east_west
from 
(select north as north_south
      , east as east_west 
 from have
 union all 
 select south as north_south
      , west as east_west
 from have
);
quit;

Result:

 mean_north_    mean_east_
       south          west
--------------------------
    3.571429        11.125

But if you really have 500 variables that is going to extremely painful to code in PROC SQL.

Why not just fix the data structure first. Then it will be easier.


proc transpose data=have out=tall;
  by id;
  var south -- west ;
run;

proc format ;
 value $nsew  
   'North','north','South','south'='North-South'
   'East','east','West','west'='East-West'
  ;
run;

proc means data=tall mean std;
  class _name_;
  format _name_ $nsew.;
  var col1;
run;

Result

The MEANS Procedure

            Analysis Variable : COL1

NAME OF
FORMER           N
VARIABLE       Obs            Mean         Std Dev
--------------------------------------------------
North-South      8       3.5714286       1.3972763

East-West        8      11.1250000       1.3562027
--------------------------------------------------

jonatan_velarde · Posted 04-27-2024 07:45 AM

The MEANS Procedure

Analysis Variable : COL1

NAME OF
FORMER N
VARIABLE Obs Mean Std Dev
--------------------------------------------------
North-South "7" 3.5714286 1.3972763

East-West 8 11.1250000 1.3562027
--------------------------------------------------

There are only seven information for the first mean.

It is needed to find the whole mean, from all the observations contained in both columns

Tom · Posted 04-27-2024 10:12 AM

So your terminology is confused, which is making it hard for you to explain your request.

DATASETS consist of OBSERVATIONS and VARIABLES. Your example dataset has only 4 observations and 5 variables. If you want to take the mean of all of the VALUES in both variables then that is what I showed you how to do.

If you want the count of the number of non missing values from PROC MEANS then ask for it. It can also tell you the number of missing values.

proc means data=tall n nmiss mean std;
  class _name_;
  format _name_ $nsew.;
  var col1;
run;

Result

The MEANS Procedure

                   Analysis Variable : COL1

NAME OF
FORMER           N             N
VARIABLE       Obs     N    Miss            Mean         Std Dev
----------------------------------------------------------------
North-South      8     7       1       3.5714286       1.3972763

East-West        8     8       0      11.1250000       1.3562027
----------------------------------------------------------------

jonatan_velarde · Posted 04-27-2024 10:29 AM

Hi Tom:

Thanks for your answer. Yeah, you are perfectly right, four observations for each variable:

data have:

id	South	North	east	west
1	2	4	12	9
2	3	.	11	10
3	2	4	12	10
4	6	4	13	12

using sql i would like to obtain:

id	South	North	mean_south_north	SD_south_north	east	west	mean_east_west	SD_east_west
1	2	4	3,57142857142857	1,39727626201154	12	9	11,125	1,35620268186054
2	3	.	3,57142857142857	1,39727626201154	11	10	11,125	1,35620268186054
3	2	4	3,57142857142857	1,39727626201154	12	10	11,125	1,35620268186054
4	6	4	3,57142857142857	1,39727626201154	13	12	11,125	1,35620268186054

where mean_south_north contains the arithmetic average, known as mean, from every observation in the columns South and North, in BLUE BOLD, as follows: 2, 3, 2, 6, 4, 4, 4; using the same thinking, SD_south_north contains the standard deviation from BLUE BOLD data, from the 4 observations and variables South and North.

Tom · Posted 04-27-2024 11:48 AM

I already showed you how to do that. You have to put those two "columns" of North and South into one "column" so that you can then use the MEAN() (or as it is commonly called in many SQL dialects AVERAGE()) aggregate function. If your SQL dialect supports it you could also use a STD() aggregate function (by whatever name your SQL dialect chooses to use for it).

You could try crafting your own logic for generating statistics.

The mean is just the sum over the count. So that is simple.

(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north

But the formula for the standard deviation is much more complicated. You would need to remerge the mean back onto every observation, find the difference, etc. etc.

Also why the heck do you want the same value repeated onto every observation.

Where are you going with this?

Also how did you end up with that input data structured where you have the values for a single variable split into two variables? It would be much easier in SQL if the data was structured with three variables (ID, DIRECTION, COUNT) and 16 observations instead of 5 variables and 4 observations.

jonatan_velarde · Posted 04-27-2024 02:59 PM

Thank you for your correct answer, now ill answer you some questions:

Also why the heck do you want the same value repeated onto every observation.

Where are you going with this?

IT IS NOT YOUR BUSINESS!,

Also how did you end up with that input data structured where you have the values for a single variable split into two variables?

YES, NOW I CAN WORK WITH MORE THAN 1000 VARIABLES

It would be much easier in SQL if the data was structured with three variables (ID, DIRECTION, COUNT) and 16 observations instead of 5 variables and 4 observations.

IT WOULD BE EASIER TO KEEP THE STRUCTURE OF QUE QUESTION, AND MOVE ON!

🙂

data have;
input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;

proc sql;
create table want as
select *,
sum(south + north) as sum_south_north,
(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north,
std(south + north) as sd_south_north,
sum(west + east) as sum_west_east,
(sum(west) + sum(east))/(count(west)+count(east)) as mean_west_east,
std(west + east) as sd_south_north
from have;
quit;

proc print data = want;
run;

Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

Re: Obtain the mean from two or more columns using SQL and put it as a new column

SAS Innovate 2025: Register Today!

SAS Training: Just a Click Away