BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
jonatan_velarde
Lapis Lazuli | Level 10

i have this data corresponding to many participants, and were collected data from them in different locations.

 

id South North east west
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12

 

now i need to obtain the mean ( or average) of all of them and put it into columns, as well as standard deviation.

 

Repair that both mean and standard deviation are considering all observations, in sql i don't want to divide the sum by the number of the observations, is not fashion LOL.

 

i would like to obtain this:

 

id South North mean_south_north SD_south_north east west mean_east_west SD_east_west
1 2 4 3,57142857142857 1,39727626201154 12 9 11,125 1,35620268186054
2 3 . 3,57142857142857 1,39727626201154 11 10 11,125 1,35620268186054
3 2 4 3,57142857142857 1,39727626201154 12 10 11,125 1,35620268186054
4 6 4 3,57142857142857 1,39727626201154 13 12 11,125 1,35620268186054

 

the first mean comes from the observations obtained in south and north, and the second, on east and west. as well their respective standard deviations.

Thanks in advance

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

I already showed you how to do that.  You have to put those two "columns" of North and South into one "column" so that you can then use the MEAN() (or as it is commonly called in many SQL dialects AVERAGE()) aggregate function.  If your SQL dialect supports it you could also use a STD() aggregate function (by whatever name your SQL dialect chooses to use for it).

 

You could try crafting your own logic for generating statistics.

The mean is just the sum over the count. So that is simple.

(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north

But the formula for the standard deviation is much more complicated.  You would need to remerge the mean back onto every observation, find the difference, etc. etc.

 

Also why the heck do you want the same value repeated onto every observation.

Where are you going with this?

Also how did you end up with that input data structured where you have the values for a single variable split into two variables?  It would be much easier in SQL if the data was structured with three variables (ID, DIRECTION, COUNT) and 16 observations instead of 5 variables and 4 observations.  

 

 

 

View solution in original post

21 REPLIES 21
dpalmer1
Fluorite | Level 6

Not a SQL solution but using PROC TRANSPOSE and PROC MEANS.

data loc;
	input id South North east west;
	datalines;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;

proc transpose data = loc out = locLong(rename = (_NAME_ = Location COL1 = Response));
	by id;
run;

proc means data = locLong noprint;
	var Response;
	where Location in ("South", "North");
	output  out = southNorth(drop = _TYPE_ _FREQ_) mean = stddev = / autoname;
run;

proc means data = locLong noprint;
	var Response;
	where Location in ("east", "west");
	output  out = eastWest(drop = _TYPE_ _FREQ_) mean = stddev = / autoname;
run;

data locFinal;
	retain mean_south_north SD_south_north;
	merge loc southNorth(rename = (Response_Mean = meansouth_north Response_StdDev = SDsouth_north)) eastWest(rename = (Response_Mean = meaneast_west Response_StdDev = SDeast_west));
	mean_south_north + meansouth_north;
	SD_south_north + SDsouth_north;
	mean_east_west + meaneast_west;
	SD_east_west + SDeast_west;
	drop meansouth_north SDsouth_north meaneast_west SDeast_west;
run;

proc print data = locFinal;
	var id South North mean_south_north SD_south_north east west mean_east_west SD_east_west;
run;
jonatan_velarde
Lapis Lazuli | Level 10
unfortunately i need sql code, because i will use this code in a database containing more that 500 columns, so thanks for the effort, I'm sure someone will use it..
Patrick
Opal | Level 21

@jonatan_velarde wrote:
unfortunately i need sql code, because i will use this code in a database containing more that 500 columns, so thanks for the effort, I'm sure someone will use it..

Please provide as much as you can all relevant information upfront.

- What database please?

- What data volume (number of rows in the table)?

- 500 columns in a single table? 

- Is there some group id in your source table? I don't expect that you want a single aggregated value for all rows.

   - Please try to share representative data

   - Please share this data via data step code that generates it so we can spend the time answering your question instead of preparing sample data.

 

jonatan_velarde
Lapis Lazuli | Level 10
Hi there!!!
Only need sql because is resumed and efficient.

The database i posted as example will let me do the job for the other. that's why i showed you this here.

Thank you
jonatan_velarde
Lapis Lazuli | Level 10
The example above contains all the information needed to be replied in a big data base. Thank you for your comprehension.
PaigeMiller
Diamond | Level 26

If your database has 500 columns and you need to do this on all 500 (in pairs), SQL is probably the worst tool to choose.

 

PROC MEANS/PROC SUMMARY will produce the means or sums for all 500 variables, you can then in a subsequent data step combine the pairs of means (weighting properly) and pairs of standard deviations (weighting properly).

 

Also see In Database Processing  for PROC MEANS 

 

UNTESTED CODE

 

%let all_names = <your 500 variable names>;

proc means data=yourdatabasetable noprint;
    var &all_names;
    output out=stats sum= n= stddev=/autoname;
run;

proc sql noprint;
    select cats(name,'_sum') into :sums separated by ' ' from dictionary.columns where lowcase(name) ? '_sum'
        and libname="WORK" and memname="STATS" order by varnum;
    select cats(name,'n') into :ns separated by ' ' from dictionary.columns where lowcase(name) ? '_n' 
        and libname="WORK" and memname="STATS" order by varnum;
    select cats(name,'stddev') into :stds separated by ' ' from dictionary.columns where lowcase(name) ? '_stddev' 
        and libname="WORK" and memname="STATS" order by varnum;
quit;

data final;
    set stats;
    array _s &sums;
    array _n &ns;
    array std &stds;
    array combined_mean mean1-mean500;
    array combined_std std1-std500;
    do i=1 to dim(_n) by 2;
        combined_mean(i)=sum(_s(i),s(i+1))/sum(_n(i),_n(i+1));
        combined_std(i)= /* Formula is at https://www.statology.org/weighted-standard-deviation-excel/ */
            /* I'm not going to program it now */ ;
    end;
run;

 

--
Paige Miller
jonatan_velarde
Lapis Lazuli | Level 10

i found A  clue:

 

proc sql;
create table want as
select *,
sum(cost + best) as sum1,
mean(cost best) as mean1,
std(cost + best) as sd1,
sum(west + east) as sum2,
mean(west + east) as mean2,
std(west + east) as sd2
from have;
quit;

 

but i cant get the correct mean.

 

thanks in advance

ballardw
Super User

@jonatan_velarde wrote:

i found A  clue:

 

proc sql;
create table want as
select *,
sum(cost + best) as sum1,
mean(cost best) as mean1,
std(cost + best) as sd1,
sum(west + east) as sum2,
mean(west + east) as mean2,
std(west + east) as sd2
from have;
quit;

 

but i cant get the correct mean.

 

thanks in advance


Your picture of the data doesn't show variables COST or BEST. So why are you including them here?

Tom
Super User Tom
Super User

I do not understand what you want, please describe in much more detail.

On your first observation the value of South is 2 and of North is 4.  The mean of 2 and 4 is exactly 3.  So why do you have  3.57... as the mean on that observation?

 

If you want to treat North and South as the same variable then make them the same variable.

 

First let's convert your LISTING into an actual DATASET.

data have;
  input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;

One way to make NORTH/SOUTH and EAST/WEST into the same variables is to just stack the two together.

proc sql;
select mean(north_south) as mean_north_south
     , mean(east_west) as mean_east_west
from 
(select north as north_south
      , east as east_west 
 from have
 union all 
 select south as north_south
      , west as east_west
 from have
);
quit;

Result:

 mean_north_    mean_east_
       south          west
--------------------------
    3.571429        11.125

But if you really have 500 variables that is going to extremely painful to code in PROC SQL.

Why not just fix the data structure first.  Then it will be easier.


proc transpose data=have out=tall;
  by id;
  var south -- west ;
run;

proc format ;
 value $nsew  
   'North','north','South','south'='North-South'
   'East','east','West','west'='East-West'
  ;
run;

proc means data=tall mean std;
  class _name_;
  format _name_ $nsew.;
  var col1;
run;

Result

The MEANS Procedure

            Analysis Variable : COL1

NAME OF
FORMER           N
VARIABLE       Obs            Mean         Std Dev
--------------------------------------------------
North-South      8       3.5714286       1.3972763

East-West        8      11.1250000       1.3562027
--------------------------------------------------
jonatan_velarde
Lapis Lazuli | Level 10
The MEANS Procedure

Analysis Variable : COL1

NAME OF
FORMER N
VARIABLE Obs Mean Std Dev
--------------------------------------------------
North-South "7" 3.5714286 1.3972763

East-West 8 11.1250000 1.3562027
--------------------------------------------------

There are only seven information for the first mean.

It is needed to find the whole mean, from all the observations contained in both columns
Tom
Super User Tom
Super User

So your terminology is confused, which is making it hard for you to explain your request.

 

DATASETS consist of OBSERVATIONS and VARIABLES.  Your example dataset has only 4 observations and 5 variables.  If you want to take the mean of all of the VALUES in both variables then that is what I showed you how to do.

 

If you want the count of the number of non missing values from PROC MEANS then ask for it.  It can also tell you the number of missing values.

proc means data=tall n nmiss mean std;
  class _name_;
  format _name_ $nsew.;
  var col1;
run;

Result

The MEANS Procedure

                   Analysis Variable : COL1

NAME OF
FORMER           N             N
VARIABLE       Obs     N    Miss            Mean         Std Dev
----------------------------------------------------------------
North-South      8     7       1       3.5714286       1.3972763

East-West        8     8       0      11.1250000       1.3562027
----------------------------------------------------------------

 

 

jonatan_velarde
Lapis Lazuli | Level 10

Hi Tom:

 

Thanks for your answer. Yeah, you are perfectly right, four observations for each variable:

data have:

id South North east west
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12

 

using sql i would like to obtain:

 

id South North mean_south_north SD_south_north east west mean_east_west SD_east_west
1 2 4 3,57142857142857 1,39727626201154 12 9 11,125 1,35620268186054
2 3 . 3,57142857142857 1,39727626201154 11 10 11,125 1,35620268186054
3 2 4 3,57142857142857 1,39727626201154 12 10 11,125 1,35620268186054
4 6 4 3,57142857142857 1,39727626201154 13 12 11,125 1,35620268186054

 

where mean_south_north contains the arithmetic average, known as mean, from every observation in the columns South and North, in BLUE BOLD, as follows: 2, 3, 2, 6, 4, 4, 4; using the same thinking, SD_south_north contains the standard deviation from BLUE BOLD data, from the 4 observations and variables South and North.

 

 

 

 

Tom
Super User Tom
Super User

I already showed you how to do that.  You have to put those two "columns" of North and South into one "column" so that you can then use the MEAN() (or as it is commonly called in many SQL dialects AVERAGE()) aggregate function.  If your SQL dialect supports it you could also use a STD() aggregate function (by whatever name your SQL dialect chooses to use for it).

 

You could try crafting your own logic for generating statistics.

The mean is just the sum over the count. So that is simple.

(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north

But the formula for the standard deviation is much more complicated.  You would need to remerge the mean back onto every observation, find the difference, etc. etc.

 

Also why the heck do you want the same value repeated onto every observation.

Where are you going with this?

Also how did you end up with that input data structured where you have the values for a single variable split into two variables?  It would be much easier in SQL if the data was structured with three variables (ID, DIRECTION, COUNT) and 16 observations instead of 5 variables and 4 observations.  

 

 

 

jonatan_velarde
Lapis Lazuli | Level 10

Thank you for your correct answer, now ill answer you some questions:

 

Also why the heck do you want the same value repeated onto every observation.

Where are you going with this?

 

IT IS NOT YOUR BUSINESS!,

 

Also how did you end up with that input data structured where you have the values for a single variable split into two variables?

 

YES, NOW I CAN WORK WITH MORE THAN 1000 VARIABLES

 

It would be much easier in SQL if the data was structured with three variables (ID, DIRECTION, COUNT) and 16 observations instead of 5 variables and 4 observations.

 

IT WOULD BE EASIER TO KEEP THE STRUCTURE OF QUE QUESTION, AND MOVE ON!

 

🙂

 

 

 

 

data have;
input id South North east west ;
cards;
1 2 4 12 9
2 3 . 11 10
3 2 4 12 10
4 6 4 13 12
;

proc sql;
create table want as
select *,
sum(south + north) as sum_south_north,
(sum(south) + sum(north))/(count(south)+count(north)) as mean_south_north,
std(south + north) as sd_south_north,
sum(west + east) as sum_west_east,
(sum(west) + sum(east))/(count(west)+count(east)) as mean_west_east,
std(west + east) as sd_south_north
from have;
quit;

proc print data = want;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 21 replies
  • 692 views
  • 4 likes
  • 7 in conversation