Re: Was the variable "RunsToDate" a new created variable?

jc3992 · Posted 03-10-2018 12:20 PM

Hello everyone,

this topic was from my assignment,

and after checking the answer provided,

I still had this question.

The data-set is as below:

The labels are:

Month Date Team Hits Runs Status

6-19 Columbia Peaches      8  3 Complete
6-20 Columbia Peaches     10  5 Complete
6-23 Plains Peanuts        3  4 Complete
6-24 Plains Peanuts        7  2 Complete
6-25 Plains Peanuts       12  8 Complete
6-30 Gilroy Garlics        .  . No Data
7-1  Gilroy Garlics        .  . No Data
7-4  Sacramento Tomatoes  15  9 Complete
7-4  Sacramento Tomatoes  10 10 Complete
7-5  Sacramento Tomatoes   2  3 Complete

The question was:

You want to accumulate the maximum number of runs that you know about. In this case, for example, record 6 should list MaxRuns=8.
You only want to accumulated the total number of runs to date until you don’t have information—when this happens you want to set RunsToDate to a missing value. In this case, for example, record 6 should list RunsToDate=.;

The code was as below:

data mydata;
infile "&dirdata/Week_5/Games_Plus.dat" truncover;
input Month 1 Day 3-4 Team $6-24 Hits 27-28 Runs 30-31 Status $9.;
retain MaxRuns RunstoDate 0 ;
MaxRuns=Max(MaxRuns, Runs);
RunsToDate=RunsToDate+Runs;
run;

proc print data=mydata;
title "Season's Record to Date, with Missing Values";
run;

My question was the "Retain" command:

Starting from here, I had not yet created variables named "MaxRuns" and "RunsToDate".

However, it seemed SAS knows this.

And I also did not understand MaxRuns: why did it state as Max(MaxRuns,Runs) instead of simply MaxRuns=Max(Runs)

And I think "RunsToDate=RunsToDate+Runs" is because it is like RunsToDate of record 3=RunsToDate of record 2 +Runs of record 3 and so on...

I guess I did not really understand about this question,

I wonder if anyone understand this and would like to guiding me a little bit.

Thanks a lot!:)

Kurt_Bremser · Posted 03-10-2018 03:21 PM

Contrary to SQL, where a summary function like max() can work over all rows, a data step (and a data step function) always deals with the current observation only. So you need to compare the current value with the retained summary value.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

jc3992 · Posted 03-10-2018 09:17 PM

Thank you very much! Now I understand. Thanks~

Kurt_Bremser · Posted 03-11-2018 04:58 AM

The retained variables are not stored in a different section of memory, they are part of the PDV like all other non-automatic variables, but the data step always does this with variables in the PDV:

variables from input datasets are retained (so when one observation of dataset A is merged with several observations of B, the values of A persist)
newly created variables are always set to missing at the start of a new datastep iteration, unless they are named in a retain statement or a summation statement of the form
```
x + n;
```
(x will be retained, n is any numeric expression)

HTH

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 03-11-2018 12:41 PM

@Kurt_Bremser

Retained variables are indeed part of the pdv, but that does not mean they have the same memory address as when they are not retained. This is what I meant by "section" of memory.

For simplicity let me restrict my example to numeric variables that are newly created in the data step.

Consider the impact on the address of variable W below. W is retained in the second data step but not the first, and as a result has a different memory address. In fact, if you have a number of new variables, and retain a subset, I have never seen a retained variable in a memory location contiguous to the non-retained vars. Instead they are contiguous to each other (separated by 8 bytes needed for numeric variables). And the non-retained vars are similarly contiguous to each other.

This is why I believe it is a useful paradigm to consider the retain statement as a memory-location assignment statement.

Even so, as you have noted, they are in the PDV, and non-retained and retained variables can be logically contiguous - handy for programming logic statement, such as array declarations, etc.

data _null_;
   set sashelp.class;
   a=age;
   w=weight;
   h=height;

   ada=addrlong(a);
   adw=addrlong(w);
   adh=addrlong(h);
   put (ad:) (=$hex16. /);
   stop;
run;

data _null_;
   set sashelp.class;
   a=age;
   w=weight;
   h=height;
   retain w;
   ada=addrlong(a);
   adw=addrlong(w);
   adh=addrlong(h);
   put (ad:) (=$hex16. /);
   stop;
run;

My system is windows, which has "little endian" addresses, so addresses such as

6840750600000000

7040750600000000

are "contiguous" (ie

location 6840750600000000 is followed by

location 6940750600000000 is followed by

location 6A40750600000000 is followed by

location 6B40750600000000 is followed by

location 6C40750600000000 is followed by

location 6D40750600000000 is followed by

location 6E40750600000000 is followed by

location 6F40750600000000 is followed by

location 7040750600000000

providing 8 bytes for the numeric variable at the first address.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 03-10-2018 04:23 PM

The retain statement can precede the corresponding value assignment statement (although the retain statement has the option of assigning in initial value. Note because of this, you can declare a retained variable even though not only the value, but also the variable type (numeric vs character) is not evident until a subsequent statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

jc3992 · Posted 03-10-2018 09:16 PM

Thank you. A bit complicated but I think I will figure it out after I am more familiar to it. Thanks!

mkeintz · Posted 03-10-2018 10:02 PM

Think of it this way. When the SAS data step encounters a retain statement it turns out that the retained variables are stored in a different part of memory than for non-retained variables. This can be demonstrated by use of the ADDRLONG function, which I don't propose to describe here.

So in a way, all the retain statement apparently does is assign a variable's location in memory to a region which the data step does not reset to missing with each new record.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Classroom Training Available!