DATA Step, Macro, Functions and more

Proc means summary statistics, join etc

Reply
Frequent Contributor
Posts: 112

Proc means summary statistics, join etc

Ok, so here is the situation-

I have a dataset called Fish which has data on fish in 120 lakes, detailed in 120 rows and 15 columns including lake name. The scope of my project is to compare each lake to other similar lakes to see whether there are differences in eight numeric characteristics (mercury level, number of fish, elevation, surface area, depth, drainage area, runoff  factor, and flushing rate). Because not much is known about the distribution of the values we want to compare a lake's value to both the mean and median of all other lakes in the data set with the same value of lake type (LT) and dam status (DAM).

The first step is to use PROC MEANS to calculate the mean and median for each of the remaining numeric variables (all eight of
them) separately for each combination of LT and DAM. Lakes that don't have a value of LT or DAM reported need to
be included in this analysis. I need to use ODS to save the results to a data set named STATZ and ensure the results are not
printed to any other destination. Here is my code to do this task and I believe its correct, but just wanted to confirm from you guys.

Proc means data = fish ;
class lt/missing; class dam/missing;
var hg n elv sa z da rf fr ;
output out = statz1 mean = median = /autoname;
run;

The next step is to combine these summary statistics with the original data set using an appropriate data step merge/join technique. This new data set is FISH2.

 

Im confused here as Im not sure on what exactly I have to join summary dataset and original one on?

The final step, after the data sets are joined, but in the same DATA step , I need to do the following using arrays.
Compute the following for each 8 variables.
{ Difference from the mean [variable - mean.of.variable]
{ Percent difference from the mean [(variable - mean.of.variable)/mean.of.variable]
{ Difference from the median [variable - median.of.variable]
{ Percent difference from the median [(variable - median.of.variable)/median.of.variable]

 

Super User
Posts: 19,815

Re: Proc means summary statistics, join etc

[ Edited ]

Merging data

http://video.sas.com/detail/videos/sas-analytics-u/video/4572997800001/merging-sas-tables-in-a-data-...

 

Arrays:

https://stats.idre.ucla.edu/sas/seminars/sas-arrays/

 


devsas wrote:

Ok, so here is the situation-

I have a dataset called Fish which has data on fish in 120 lakes, detailed in 120 rows and 15 columns including lake name. The scope of my project is to compare each lake to other similar lakes to see whether there are differences in eight numeric characteristics (mercury level, number of fish, elevation, surface area, depth, drainage area, runoff  factor, and flushing rate). Because not much is known about the distribution of the values we want to compare a lake's value to both the mean and median of all other lakes in the data set with the same value of lake type (LT) and dam status (DAM).

The first step is to use PROC MEANS to calculate the mean and median for each of the remaining numeric variables (all eight of
them) separately for each combination of LT and DAM. Lakes that don't have a value of LT or DAM reported need to
be included in this analysis. I need to use ODS to save the results to a data set named STATZ and ensure the results are not
printed to any other destination. Here is my code to do this task and I believe its correct, but just wanted to confirm from you guys.

Proc means data = fish ;
class lt/missing; class dam/missing;
var hg n elv sa z da rf fr ;
output out = statz1 mean = median = /autoname;
run;

The next step is to combine these summary statistics with the original data set using an appropriate data step merge/join technique. This new data set is FISH2.

 

Im confused here as Im not sure on what exactly I have to join summary dataset and original one on?

The final step, after the data sets are joined, but in the same DATA step , I need to do the following using arrays.
Compute the following for each 8 variables.
{ Difference from the mean [variable - mean.of.variable]
{ Percent difference from the mean [(variable - mean.of.variable)/mean.of.variable]
{ Difference from the median [variable - median.of.variable]
{ Percent difference from the median [(variable - median.of.variable)/median.of.variable]

 


 

PROC Star
Posts: 1,760

Re: Proc means summary statistics, join etc

Like this?


proc summary data=FISH missing nway;
  class LT DAM;
  var HG N ELV SA Z DA RF FR ;
  output out=SUM mean= median= /autoname;
run;
proc sql;
create table FISH2 as select LAKENAME ,HG-HG_MEAN as DIFF_AVG_HG ,HG-HG_MEDIAN as DIFF_MED_HG ,etc from FISH left join SUM on HAVE.LT=SUM.LT and HAVE.DAM=SUM.DAM ; quit;

 

Frequent Contributor
Posts: 112

Re: Proc means summary statistics, join etc

Thanks so much! I see that proc summary gets pretty much the same results as my code using proc means do, except that _type_ = 3 is the only one in your results. Can you please explain the significance of that and how can I get the same result using proc means?

I believe I will do the merge using the data step and use arrays for repetitive statements.

Super User
Posts: 19,815

Re: Proc means summary statistics, join etc

Then you need to post what you're using/trying. 

 

I suspect you're missing the NWAY option in PROC MEANS. Note that MEANS/SUMMARY/UNIVARIATE are all very similar in how they work and in a situation like yours pretty much interchangeable.

Frequent Contributor
Posts: 112

Re: Proc means summary statistics, join etc

Yes, thanks you are right. I used the nway option and got the same results. Im still trying to understand though what it means. Doing some reading, but still not clear.

Super User
Posts: 19,815

Re: Proc means summary statistics, join etc

The documentation is pretty clear.

NWAY

limits the output statistics to the observations with the highest _TYPE_ value.

 

 

_TYPE_ is defined when you use CLASS statements. You can also control the levels reported using the WAYS and TYPES statements. 

I suggest searching PROC MEANS on the LexJansen website and see what it's capable of, its a really powerful procedure.

 

http://documentation.sas.com/?docsetId=proc&docsetVersion=9.4&docsetTarget=n1qnc9bddfvhzqn105kqitnf2...

Respected Advisor
Posts: 4,925

Re: Proc means summary statistics, join etc

Looks like you could do the whole thing with a single SQL query (I added a macro to save on typing)

 

%macro stats(v);
	&v. - mean(&v.) as &v._err,
	(&v. - mean(&v.)) / mean(&v.) as &v._err_pct format=percent7.1,
	&v. - median(&v.) as &v._diff,
	(&v. - median(&v.)) / median(&v.) as &v._diff_pct format=percent7.1
%mend stats;

proc sql;
create table fish2 as
select 
	*,
	%stats(hg),
    %stats(n),
    %stats(elv),
    %stats(sa),
    %stats(z),
    %stats(da),
    %stats(rf),
    %stats(fr)
from fish
group by lt, dam;
quit;
PG
Frequent Contributor
Posts: 112

Re: Proc means summary statistics, join etc

Thanks PGSTATS, your solution works too, its just that for this exercise I want to do it using proc means and arrays only to understand these better.

Frequent Contributor
Posts: 112

Re: Proc means summary statistics, join etc

Ok, I tried doing it with data step and arrays, but got stuck again. Here is my data step merge code, not sure how can I integrate array statement here to calculate mean, median differences and percent differences?

data fish2_ (keep = name hg n elv sa z da rf fr hg_mean n_mean elv_mean sa_mean z_mean da_mean rf_mean fr_mean hg_median n_median elv_median sa_median z_median da_median rf_median fr_median) ;
 merge fish (in = x) statz2 (in = y);
 by lt dam;
 if x = 1;
  run;
Frequent Contributor
Posts: 112

Re: Proc means summary statistics, join etc

Ok, so it seems I have nailed the first two steps using proc means and data step merge, but the last step using arrays im still not getting it. Here is my complete code-

Proc means data = fish missing nway noprint ;
class lt/missing; class dam/missing;
var hg n elv sa z da rf fr ;
output out = statz2 mean = median = /autoname;
run;

data fish2_ (keep = name hg n elv sa z da rf fr hg_mean n_mean elv_mean sa_mean z_mean da_mean rf_mean fr_mean hg_median n_median elv_median sa_median z_median da_median rf_median fr_median) ;
 merge fish (in = x) statz2 (in = y);
 by lt dam;
 if x = 1;
  run;

*using arrays to calculate mean difference only;
  data fish_f;
set fish2_;
array all{*} hg n elv sa z da rf fr;
array all_mean{*} hg_mean n_mean elv_mean sa_mean z_mean da_mean rf_mean fr_mean ;
do i=1 to dim(all);
diff[i]= all_mean{i} - all{i};
end;
run;

Please note that I only want to use data step and arrays to solve this problem

 

Super User
Posts: 19,815

Re: Proc means summary statistics, join etc

diff[i]

Where did you define a DIFF array. I'm sure your log has errors....read the log. ALWAYS. 

Frequent Contributor
Posts: 112

Re: Proc means summary statistics, join etc

Yes, that was a mistake. But I defined it now and still get errors-

data fish_f;
set fish2_;
array all{*} hg n elv sa z da rf fr;
array all_mean{*} hg_mean n_mean elv_mean sa_mean z_mean da_mean rf_mean fr_mean ;
array diff{*} diff_all_mean-diff_all;
do i=1 to dim(all);
diff[i]= all_mean{i} - all{i};
end;
run;
Frequent Contributor
Posts: 149

Re: Proc means summary statistics, join etc

Posting the log would be really helpful.
Frequent Contributor
Posts: 112

Re: Proc means summary statistics, join etc

Posted in reply to error_prone
319  array diff{*} diff_all_mean-diff_all;
ERROR: Missing numeric suffix on a numbered variable list (diff_all_mean-diff_all).
WARNING: Defining an array with zero elements.
320  do i=1 to dim(all);
Ask a Question
Discussion stats
  • 19 replies
  • 174 views
  • 4 likes
  • 5 in conversation