BookmarkSubscribeRSS Feed
devsas
Pyrite | Level 9

Ok, so here is the situation-

I have a dataset called Fish which has data on fish in 120 lakes, detailed in 120 rows and 15 columns including lake name. The scope of my project is to compare each lake to other similar lakes to see whether there are differences in eight numeric characteristics (mercury level, number of fish, elevation, surface area, depth, drainage area, runoff  factor, and flushing rate). Because not much is known about the distribution of the values we want to compare a lake's value to both the mean and median of all other lakes in the data set with the same value of lake type (LT) and dam status (DAM).

The first step is to use PROC MEANS to calculate the mean and median for each of the remaining numeric variables (all eight of
them) separately for each combination of LT and DAM. Lakes that don't have a value of LT or DAM reported need to
be included in this analysis. I need to use ODS to save the results to a data set named STATZ and ensure the results are not
printed to any other destination. Here is my code to do this task and I believe its correct, but just wanted to confirm from you guys.

Proc means data = fish ;
class lt/missing; class dam/missing;
var hg n elv sa z da rf fr ;
output out = statz1 mean = median = /autoname;
run;

The next step is to combine these summary statistics with the original data set using an appropriate data step merge/join technique. This new data set is FISH2.

 

Im confused here as Im not sure on what exactly I have to join summary dataset and original one on?

The final step, after the data sets are joined, but in the same DATA step , I need to do the following using arrays.
Compute the following for each 8 variables.
{ Difference from the mean [variable - mean.of.variable]
{ Percent difference from the mean [(variable - mean.of.variable)/mean.of.variable]
{ Difference from the median [variable - median.of.variable]
{ Percent difference from the median [(variable - median.of.variable)/median.of.variable]

 

19 REPLIES 19
Reeza
Super User

Merging data

http://video.sas.com/detail/videos/sas-analytics-u/video/4572997800001/merging-sas-tables-in-a-data-...

 

Arrays:

https://stats.idre.ucla.edu/sas/seminars/sas-arrays/

 


@devsas wrote:

Ok, so here is the situation-

I have a dataset called Fish which has data on fish in 120 lakes, detailed in 120 rows and 15 columns including lake name. The scope of my project is to compare each lake to other similar lakes to see whether there are differences in eight numeric characteristics (mercury level, number of fish, elevation, surface area, depth, drainage area, runoff  factor, and flushing rate). Because not much is known about the distribution of the values we want to compare a lake's value to both the mean and median of all other lakes in the data set with the same value of lake type (LT) and dam status (DAM).

The first step is to use PROC MEANS to calculate the mean and median for each of the remaining numeric variables (all eight of
them) separately for each combination of LT and DAM. Lakes that don't have a value of LT or DAM reported need to
be included in this analysis. I need to use ODS to save the results to a data set named STATZ and ensure the results are not
printed to any other destination. Here is my code to do this task and I believe its correct, but just wanted to confirm from you guys.

Proc means data = fish ;
class lt/missing; class dam/missing;
var hg n elv sa z da rf fr ;
output out = statz1 mean = median = /autoname;
run;

The next step is to combine these summary statistics with the original data set using an appropriate data step merge/join technique. This new data set is FISH2.

 

Im confused here as Im not sure on what exactly I have to join summary dataset and original one on?

The final step, after the data sets are joined, but in the same DATA step , I need to do the following using arrays.
Compute the following for each 8 variables.
{ Difference from the mean [variable - mean.of.variable]
{ Percent difference from the mean [(variable - mean.of.variable)/mean.of.variable]
{ Difference from the median [variable - median.of.variable]
{ Percent difference from the median [(variable - median.of.variable)/median.of.variable]

 


 

ChrisNZ
Tourmaline | Level 20

Like this?


proc summary data=FISH missing nway;
  class LT DAM;
  var HG N ELV SA Z DA RF FR ;
  output out=SUM mean= median= /autoname;
run;
proc sql;
create table FISH2 as select LAKENAME ,HG-HG_MEAN as DIFF_AVG_HG ,HG-HG_MEDIAN as DIFF_MED_HG ,etc from FISH left join SUM on HAVE.LT=SUM.LT and HAVE.DAM=SUM.DAM ; quit;

 

devsas
Pyrite | Level 9

Thanks so much! I see that proc summary gets pretty much the same results as my code using proc means do, except that _type_ = 3 is the only one in your results. Can you please explain the significance of that and how can I get the same result using proc means?

I believe I will do the merge using the data step and use arrays for repetitive statements.

Reeza
Super User

Then you need to post what you're using/trying. 

 

I suspect you're missing the NWAY option in PROC MEANS. Note that MEANS/SUMMARY/UNIVARIATE are all very similar in how they work and in a situation like yours pretty much interchangeable.

devsas
Pyrite | Level 9

Yes, thanks you are right. I used the nway option and got the same results. Im still trying to understand though what it means. Doing some reading, but still not clear.

Reeza
Super User

The documentation is pretty clear.

NWAY

limits the output statistics to the observations with the highest _TYPE_ value.

 

 

_TYPE_ is defined when you use CLASS statements. You can also control the levels reported using the WAYS and TYPES statements. 

I suggest searching PROC MEANS on the LexJansen website and see what it's capable of, its a really powerful procedure.

 

http://documentation.sas.com/?docsetId=proc&docsetVersion=9.4&docsetTarget=n1qnc9bddfvhzqn105kqitnf2...

PGStats
Opal | Level 21

Looks like you could do the whole thing with a single SQL query (I added a macro to save on typing)

 

%macro stats(v);
	&v. - mean(&v.) as &v._err,
	(&v. - mean(&v.)) / mean(&v.) as &v._err_pct format=percent7.1,
	&v. - median(&v.) as &v._diff,
	(&v. - median(&v.)) / median(&v.) as &v._diff_pct format=percent7.1
%mend stats;

proc sql;
create table fish2 as
select 
	*,
	%stats(hg),
    %stats(n),
    %stats(elv),
    %stats(sa),
    %stats(z),
    %stats(da),
    %stats(rf),
    %stats(fr)
from fish
group by lt, dam;
quit;
PG
devsas
Pyrite | Level 9

Thanks PGSTATS, your solution works too, its just that for this exercise I want to do it using proc means and arrays only to understand these better.

devsas
Pyrite | Level 9

Ok, I tried doing it with data step and arrays, but got stuck again. Here is my data step merge code, not sure how can I integrate array statement here to calculate mean, median differences and percent differences?

data fish2_ (keep = name hg n elv sa z da rf fr hg_mean n_mean elv_mean sa_mean z_mean da_mean rf_mean fr_mean hg_median n_median elv_median sa_median z_median da_median rf_median fr_median) ;
 merge fish (in = x) statz2 (in = y);
 by lt dam;
 if x = 1;
  run;
devsas
Pyrite | Level 9

Ok, so it seems I have nailed the first two steps using proc means and data step merge, but the last step using arrays im still not getting it. Here is my complete code-

Proc means data = fish missing nway noprint ;
class lt/missing; class dam/missing;
var hg n elv sa z da rf fr ;
output out = statz2 mean = median = /autoname;
run;

data fish2_ (keep = name hg n elv sa z da rf fr hg_mean n_mean elv_mean sa_mean z_mean da_mean rf_mean fr_mean hg_median n_median elv_median sa_median z_median da_median rf_median fr_median) ;
 merge fish (in = x) statz2 (in = y);
 by lt dam;
 if x = 1;
  run;

*using arrays to calculate mean difference only;
  data fish_f;
set fish2_;
array all{*} hg n elv sa z da rf fr;
array all_mean{*} hg_mean n_mean elv_mean sa_mean z_mean da_mean rf_mean fr_mean ;
do i=1 to dim(all);
diff[i]= all_mean{i} - all{i};
end;
run;

Please note that I only want to use data step and arrays to solve this problem

 

Reeza
Super User
diff[i]

Where did you define a DIFF array. I'm sure your log has errors....read the log. ALWAYS. 

devsas
Pyrite | Level 9

Yes, that was a mistake. But I defined it now and still get errors-

data fish_f;
set fish2_;
array all{*} hg n elv sa z da rf fr;
array all_mean{*} hg_mean n_mean elv_mean sa_mean z_mean da_mean rf_mean fr_mean ;
array diff{*} diff_all_mean-diff_all;
do i=1 to dim(all);
diff[i]= all_mean{i} - all{i};
end;
run;
error_prone
Barite | Level 11
Posting the log would be really helpful.
devsas
Pyrite | Level 9
319  array diff{*} diff_all_mean-diff_all;
ERROR: Missing numeric suffix on a numbered variable list (diff_all_mean-diff_all).
WARNING: Defining an array with zero elements.
320  do i=1 to dim(all);

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 19 replies
  • 5158 views
  • 4 likes
  • 5 in conversation