Help using Base SAS procedures

A Proc Sort Mystery. Please help!

Reply
Contributor
Posts: 72

A Proc Sort Mystery. Please help!

Hi all,

I have a bunch of panel data/longitudinal observations as follows

fund_id firm_id report_date
A 4545 30/6/1998
A 6688 30/12/1998
B 4545 30/3/2000
B 4545 30/9/2000

B 9535 30/3/2001
C 4656 ....
D 7586
D 8965

As you can see, the data are reported sometimes twice per year (as I highlighted bold above). I use Proc Sort as below

proc sort data=data;
by fund_id firm_id report_date;data keep_latest_report_date_only;
set data;
by fund_id firm_id report_date;
if last.report_date;run;

BTW, in the dataset report_date is already a date format (not numeric or anything)

However, the code above doesn't work. It doesn't remove the earlier report dates. BUT when I use this code just below, it works perfectly.

data data;
set data;
year=year(rep_date);run;
proc sort data=data;
by fund_id firm_id year;
data keep_latest_report_date_only;
by fund_id firm_id year;
if last.year;run;

My question: Why is this so? Why YEAR function works while not using it doesn't help remove the older report dates?

Thank you very much

Message was edited by: smilingmelbourne Message was edited by: smilingmelbourne
Super User
Posts: 17,819

Re: A Proc Sort Mystery. Please help!

Your BY processing by 3 variables, Fund ID, Firm ID and Year.

For the records indicated the Firm ID's are different so they get put into different groups and each will be the last.

Check the documentation on BY processing.

Check the proc sort with nodupkey option and sort descending options.

HTH!
Contributor
Posts: 72

Re: A Proc Sort Mystery. Please help!

Sorry that the example data I put in didn't convey my idea. What I really want to figoure out is that why for the same firm that has more than 2 report dates, proc sort with rep_date doesn't help in removing older report dates. In the original post, I didn't make clear my question when I typed in different firms and this of course would be treated as a different group.

In my data sets, it looks sth like:

Fund_ID Firm_ID Report_Date
A 123 oldest date
A 123 older date
A 123 latest date
B
B
B

So 123 treated as a group, and if using last.report_date after proc sort by fund_id firm_id report_date, IT SHOULD keep only "latest date", but it doesn't until I use Year function for the report_date, i.e. year=year(report_date), and then replace report_date by year in the proc sort...

Thanks
Super Contributor
Posts: 356

Re: A Proc Sort Mystery. Please help!

The reason it works for year is that if you have two dates with the same year you then have two objects for the year so it takes the last.... if you have unique dates each date will be first and last so selects all of them...


fund_id firm_id report_date YEAR
A 4545 30/6/1998 1998
A 6688 30/12/1998 1998

B 4545 30/3/2000 2000 (First and Last Date)(First Year)
B 4545 30/9/2000 2000 (First and Last Date)(Last Year)

B 9535 30/3/2001 2001

C 4656 ....
D 7586
D 8965






what you want to do is have

Last.Firm_ID rather Last.Report_Date

which will give you the latest date for each

Clear as Mud

Barry

Message was edited by: twocanbazza

Message was edited by: twocanbazza Message was edited by: twocanbazza
Contributor
Posts: 72

Re: A Proc Sort Mystery. Please help!

If I use LAST.FIRM_ID, then my panel data/longitudinal data structure will be gone! :-). I must use LAST.STH, where STH is for REPORT_DATE. I've used YEAR() for REPORT_DATE and it work just perfect. My query is why it cannot sort by REPORT_DATE whereby latest date is the last in a group and oldest the first.

From this::

fund_id firm_id report_date
A B 30/5/2000
A B 31/12/2000
A B 30/3/2001
A B 30/6/2001
A B 31/12/2001

I want this::
fund_id firm_id report_date
A B 31/12/2000
A B 31/12/2001

Of course, I used YEAR() for report_date and it was all OK, but I am wondering why I can't use this code below for the same results

proc sort data=sth;
by fund_id firm_id report_date;
data=sth_latest_date; /*for each fund, keep only the latest obs for each of the firms that the fund invests in*/
if LAST.REPORT_DATE;
run;
Valued Guide
Posts: 632

Re: A Proc Sort Mystery. Please help!

In your final post you have mixed steps. SAS processes one step at a time and the steps must have clear boundaries (steps are never nested or imbedded, but are always distinct from each other). The SET statement is used within a DATA step not a PROC step.

DATA steps start with a DATA statement
[pre] data new;[/pre]

PROC steps start with a PROC statement
[pre] proc sort data=new out=newer;[/pre]

Although not strictly necessary, good programming practice uses the RUN: or QUIT; statement to terminate a step.
Super Contributor
Super Contributor
Posts: 365

Re: A Proc Sort Mystery. Please help!

Hello SmilingMelbourne,

If I rearrange your code
[pre]
proc sort data=sth;
by fund_id firm_id report_date;
data=sth_latest_date; /*for each fund, keep only the latest obs for each of the firms that the fund invests in*/
if LAST.REPORT_DATE;
run;
[/pre]
in more readable one:
[pre]
proc sort data=sth;
by fund_id firm_id report_date;
run;
data=sth_latest_date;
set sth;
/*for each fund, keep only the latest obs for each of the firms that the fund invests in*/
if LAST.REPORT_DATE;
run;
[/pre]
then it obvious that the datastep does not contain necessary BY FUND_ID FIRM_ID and instead of your IF it should be IF LAST.FIRM_ID. However, this is not what it seems to me you want. You need last date within each YEAR but this code simply gives the most recent date and eliminates the necessary observation A B 31/12/2000.

Sincerely,
SPR
Contributor
Posts: 72

Re: A Proc Sort Mystery. Please help!

Thank you all for your help. I think I should post in here again a small subset of the actual data set, the logs of the codes that I wrote that didn't work and that work.

Original Data Set
owner_id security_id shr_held value_held rep_date

2000004 29191 0 0 31/03/2004
2000004 113563 666642 2266583 31/12/2004
2000004 113563 666642 2119922 31/03/2005
2000004 113563 0 0 31/05/2005
2000004 113571 80801 1260382 30/09/2005
2000004 113571 209629 3136985 31/12/2005
2000004 113571 87283 1443286 31/03/2006
2000004 113571 91624 1555180 30/06/2006
2000004 113571 89039 1493031 30/09/2006
2000004 113571 0 0 31/12/2006

Codes that did NOT work and their logs
738 proc sort data=mf;
739 by owner_id security_id rep_date;

NOTE: There were 25711268 observations read from the data set WORK.MF.
NOTE: The data set WORK.MF has 25711268 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 3:14.04
cpu time 27.95 seconds

740 data nodup;
741 set mf;
742 by owner_id security_id rep_date;
743 if last.rep_date;
744 run;

NOTE: There were 25711268 observations read from the data set WORK.MF.
NOTE: The data set WORK.NODUP has 25711268 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 24.79 seconds
cpu time 19.18 seconds

Codes that WORKED and their logs
745 data mf;
746 set mf;
747 year=year(rep_date);
748 run;

749 proc sort data=mf;
750 by owner_id security_id year;

NOTE: There were 25711268 observations read from the data set WORK.MF.
NOTE: The data set WORK.MF has 25711268 observations and 6 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 2:59.32
cpu time 28.57 seconds


751 data nodup;
752 set mf;
753 by owner_id security_id year;
754 if last.year;
755 run;

NOTE: There were 25711268 observations read from the data set WORK.MF.
NOTE: The data set WORK.NODUP has 11870591 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 24.23 seconds
cpu time 12.23 seconds

As you can see, the ONLY thing that I changed that make the codes work is throw away rep_date and replace it with year, where year is extracted using YEAR().

I still don't understand why:
- if year=year(rep_date) works, i.e. it can tell the difference between e.g. 30/3/2000 and 30/5/2000, then it means that SAS knows that these 2 dates have are different in months (march vs may) and it can sort in descending order or so

- if SAS can distinguish between march vs may in 30/3/2000 and 30/5/2000, why the code using rep_date in last.rep_date not work? For the same fund id 2000004 and the same firm id 113563, we have different dates (different in months), so logically SAS can sort from oldest dates on top and latest dates on bottom for each of the securities and for each of the firms.

- Why do I NOT need to use the same code as Cynthia so as to make it work, that is, I didn't use year as a group before rep_date and it still worked.

I use this: and it still worked
data nodup;
set mf;
by owner_id security_id year;
if last.year;
run;

Instead of this as Cynthia recommends:
data nodup;
set mf;
by owner_id security_id year rep_date;if last.rep_date;
run;
Super Contributor
Super Contributor
Posts: 3,174

Re: A Proc Sort Mystery. Please help!

The OP needs to re-read the entire thread, especially Cynthia's explanation that makes use of the PUTLOG _ALL_; processing to demonstrate how BY GROUP processing is behaving. With IF LAST.rep_date specified, SAS is acknowledging each unique date occurrence and is going to output the "last" one for each unique date value --- not a date within a particular 'year-portion of any given date'. That is where the code is flawed -- that is if the OP wants to only capture the last occurrence within a given calendar year.

You will benefit from adding diagnostic PUTLOG commands to your SAS program and running it yourself with your data -- and suggest running it multiple times with different BY variable definitions -- the most important BY variable being the one you use in your IF LAST. specification for output.

Scott Barry
SBBWorks, Inc.

Suggested Google advanced search argument, this topic / post:
by group processing site:sas.com
SAS Super FREQ
Posts: 8,743

Re: A Proc Sort Mystery. Please help!

Hi:
I left REPORT_DATE in my code only for teaching purposes. I wanted to show you that LAST.REPORT_DATE was inappropriate compared to LAST.R_YEAR -- in order to illustrate that point, REPORT_DATE had to be in my BY statement.

My code was not proposed as the "solution" to your issue. My code was meant to show you that LAST.REPORT_DATE was inappropriate as a condition to get you the desired rows. If I had been solving your problem, I would have written a program that used only LAST.R_YEAR without LAST.REPORT_DATE. I would NOT have captured every single FIRST. and LAST. variable.

It seems to me that your fundamental problem lies with this statement/belief where you say:

- if SAS can distinguish between march vs may in 30/3/2000 and 30/5/2000, why the code using rep_date in last.rep_date not work? For the same fund id 2000004 and the same firm id 113563, we have different dates (different in months), so logically SAS can sort from oldest dates on top and latest dates on bottom for each of the securities and for each of the firms.



OK, let's go back to fundamentals. SAS dates. What are they. SAS handles dates as an offset from January 1, 1960. What does that mean? January 1, 1960 is internally stored as 0. 15Nov1950 was 3334 days BEFORE January 1, 1960 -- so 15Nov1950 is internally stored as -3334 in a SAS dataset. As you can see, internally, the number is just a number - -it only becomes meaningful as a date when you
1) use it with a SAS date function that knows what the number represents or
2) display it with a SAS date format.

But, internally, your date is stored as a single number, no matter how you read it in, no matter what format you use. That single number is what is used in comparisons and conditions. If you need a date-related piece out of that single number, you must use a SAS function to get that piece.

Consider THIS program and data:
[pre]
data what_are_dates;
infile datalines;
input name $ bday_orig : mmddyy10.;

bday_unfmt = bday_orig;
bday_diff = bday_orig;

** Extract some date information from the numeric variable;
mm = month(bday_orig);
dd = day(bday_orig);
year = year(bday_orig);
week = week(bday_orig);
quarter = qtr(bday_orig);

putlog ' *** *** *** *** *** *** *** ';
putlog _all_;
putlog name= bday_unfmt= bday_orig= mmddyy10. bday_diff= date9. mm= dd= year= week= quarter=;
return;
datalines;
alan 11/15/1950
barb 03/30/2000
carl 05/30/2000
dana 07/31/2001
edna 11/15/2010
;
run;
[/pre]

Note that the dates, in my input data are understandable as regular date values. But when I read the dates in my program with the mmddyy10. informat, I am instructing SAS to convert the dates from their conventional date representation into the internal number that represents that date's "distance" from January 1, 1960.

Here's the LOG from the above program:
[pre]
2041 data what_are_dates;
2042 infile datalines;
2043 input name $ bday_orig : mmddyy10.;
2044 bday_unfmt = bday_orig;
2045 bday_diff = bday_orig;
2046 mm = month(bday_orig);
2047 dd = day(bday_orig);
2048 year = year(bday_orig);
2049 week = week(bday_orig);
2050 quarter = qtr(bday_orig);
2051 putlog ' *** *** *** *** *** *** *** ';
2052 putlog _all_;
2053 putlog name= bday_unfmt= bday_orig= mmddyy10. bday_diff= date9. mm= dd= year= week= quarter=;
2054 return;
2055 datalines;

*** *** *** *** *** *** ***
name=alan bday_orig=-3334 bday_unfmt=-3334 bday_diff=-3334 mm=11 dd=15 year=1950 week=46 quarter=4 _ERROR_=0 _N_=1
name=alan bday_unfmt=-3334 bday_orig=11/15/1950 bday_diff=15NOV1950 mm=11 dd=15 year=1950 week=46 quarter=4
*** *** *** *** *** *** ***
name=barb bday_orig=14699 bday_unfmt=14699 bday_diff=14699 mm=3 dd=30 year=2000 week=13 quarter=1 _ERROR_=0 _N_=2
name=barb bday_unfmt=14699 bday_orig=03/30/2000 bday_diff=30MAR2000 mm=3 dd=30 year=2000 week=13 quarter=1
*** *** *** *** *** *** ***
name=carl bday_orig=14760 bday_unfmt=14760 bday_diff=14760 mm=5 dd=30 year=2000 week=22 quarter=2 _ERROR_=0 _N_=3
name=carl bday_unfmt=14760 bday_orig=05/30/2000 bday_diff=30MAY2000 mm=5 dd=30 year=2000 week=22 quarter=2
*** *** *** *** *** *** ***
name=dana bday_orig=15187 bday_unfmt=15187 bday_diff=15187 mm=7 dd=31 year=2001 week=30 quarter=3 _ERROR_=0 _N_=4
name=dana bday_unfmt=15187 bday_orig=07/31/2001 bday_diff=31JUL2001 mm=7 dd=31 year=2001 week=30 quarter=3
*** *** *** *** *** *** ***
name=edna bday_orig=18581 bday_unfmt=18581 bday_diff=18581 mm=11 dd=15 year=2010 week=46 quarter=4 _ERROR_=0 _N_=5
name=edna bday_unfmt=18581 bday_orig=11/15/2010 bday_diff=15NOV2010 mm=11 dd=15 year=2010 week=46 quarter=4
NOTE: The data set WORK.WHAT_ARE_DATES has 5 observations and 9 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
[/pre]

From the PUTLOG statements, you can see that 30Mar2000 is internally stored as the single number 14699 and that 30May2000 is internally stored as the single number 14760. That's how SAS tells the 2 dates apart. But, when the number is stored as 14699 or 14760, SAS knows that it is a date, but a LAST.BDAY condition would be set to 1 for every date (if I used LAST.BDAY in the program, which I did NOT).

Note how the use of the date-related functions MONTH, DAY, YEAR, WEEK and QTR resulted in the extraction of "date" pieces from the program.

A PROC PRINT on the dataset can use different formats or no formats to display the internally stored number, but no matter what the number looks like in the PROC PRINT output -- all the date fields are INTERNALLY stored as a single number:
Code:
[pre]
proc print data=what_are_dates;
title 'Show formatted values for some date fields';
format bday_orig weekdate. bday_diff date9.;
run;
[/pre]

Output:
[pre]
Show formatted values for some date fields

bday_
Obs name bday_orig unfmt bday_diff mm dd year week quarter

1 alan Wednesday, November 15, 1950 -3334 15NOV1950 11 15 1950 46 4
2 barb Thursday, March 30, 2000 14699 30MAR2000 3 30 2000 13 1
3 carl Tuesday, May 30, 2000 14760 30MAY2000 5 30 2000 22 2
4 dana Tuesday, July 31, 2001 15187 31JUL2001 7 31 2001 30 3
5 edna Monday, November 15, 2010 18581 15NOV2010 11 15 2010 46 4
[/pre]

Now, let's look at BY group processing on the above data First we have to sort:
[pre]
proc sort data=what_are_dates;
by descending year descending bday_orig ;
run;

proc print data=what_are_dates;
title 'After sorting -- use different formats for dates';
** null FORMAT stmt removes all formats for display purposes;
format bday_orig ddmmyy10. bday_diff date9.;
run;
[/pre]

The newly sorted data (in descending order from highest to lowest) is:
[pre]
After sorting -- use different formats for dates

bday_
Obs name bday_orig unfmt bday_diff mm dd year week quarter

1 edna 15/11/2010 18581 15NOV2010 11 15 2010 46 4
2 dana 31/07/2001 15187 31JUL2001 7 31 2001 30 3
3 carl 30/05/2000 14760 30MAY2000 5 30 2000 22 2
4 barb 30/03/2000 14699 30MAR2000 3 30 2000 13 1
5 alan 15/11/1950 -3334 15NOV1950 11 15 1950 46 4
[/pre]

OK..now we'll use FIRST. and LAST. with the above sorted data in this program:
[pre]
2202
2203 data ck_first_last;
2204 set what_are_dates;
2205 by descending year descending bday_orig;
2206 first_bd = first.bday_orig;
2207 last_bd = last.bday_orig;
2208 first_yr = first.year;
2209 last_yr = last.year;
2210
2211 putlog _n_= name= bday_orig= first_bd= last_bd= year= first_yr= last_yr=;
2212 run;

_N_=1 name=edna bday_orig=18581 first_bd=1 last_bd=1 year=2010 first_yr=1 last_yr=1
_N_=2 name=dana bday_orig=15187 first_bd=1 last_bd=1 year=2001 first_yr=1 last_yr=1
_N_=3 name=carl bday_orig=14760 first_bd=1 last_bd=1 year=2000 first_yr=1 last_yr=0
_N_=4 name=barb bday_orig=14699 first_bd=1 last_bd=1 year=2000 first_yr=0 last_yr=1
_N_=5 name=alan bday_orig=-3334 first_bd=1 last_bd=1 year=1950 first_yr=1 last_yr=1
NOTE: There were 5 observations read from the data set WORK.WHAT_ARE_DATES.
NOTE: The data set WORK.CK_FIRST_LAST has 5 observations and 13 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
[/pre]

In the above program, BDAY_ORIG -- is showing the internally stored number for birthdate and the automatic variable LAST.BDAY_ORIG (captured into the LAST_BD variable) is set to 1 on EVERY row.

So that is a LONG answer to your question. The short answer to your question
- if SAS can distinguish between march vs may in 30/3/2000 and 30/5/2000, why the code using rep_date in last.rep_date not work?

is that SAS ONLY distinguishes between 14699 and 14760 internally -- it is your format that gives "date-meaning" to the numbers. When you used the YEAR function, you told SAS to figure out the date for the internally stored number and to extract the 4 digit year from the date into a separate variable.

cynthia
Super Contributor
Super Contributor
Posts: 365

Re: A Proc Sort Mystery. Please help!

Hello, SmilingMelbourne,

"I still don't understand why:
- if year=year(rep_date) works, i.e. it can tell the difference between e.g. 30/3/2000 and 30/5/2000 , then it means that SAS knows that these 2 dates have are different in months (march vs may) and it can sort in descending order or so"

In reality SAS does NOT distinguish between these dates. It selects the one that happens to be the LAST in your initial dataset for the current sort group. If I placed these dates in the order:
30/3/2000
30/5/2000
then I got 30/5/2000 as the result. If I exchanged their places I got 30/3/2000.

If you would like to select the LAST date of the year you have to make a bit change in your program:
[pre]
data i;
input owner_id security_id shr_held value_held rep_date ddmmyy10.;
format rep_date date7.;
datalines;
2000004 29191 0 0 31/03/2004
2000004 113563 666642 2266583 31/12/2004
2000004 113563 666642 2119922 31/03/2005
2000004 113563 0 0 31/05/2005
2000004 113571 80801 1260382 30/09/2005
2000004 113571 209629 3136985 31/12/2005
2000004 113571 87283 1443286 31/03/2006
2000004 113571 91624 1555180 30/06/2006
2000004 113571 89039 1493031 30/09/2006
2000004 113571 0 0 31/12/2006
2000004 113571 0 0 31/05/2000
2000004 113571 0 0 31/03/2000
;
run;
data mf;
set i;
year=year(rep_If you would like to date);
run;
proc sort data=mf;
by owner_id security_id year REP_DATE;
data nodup;
set mf;
by owner_id security_id year;
if last.year;
run;
[/pre]
Look at additional sorting by rep_date.
Sincerely,
SPR
Super User
Posts: 9,676

Re: A Proc Sort Mystery. Please help!

Hi.
Because your highlighted blod two observations has two different date value,
So SAS treat them as two different group.You use Year() to make them have the same value,so can work.
You need data such as:
B 4545 30/3/2000
B 4545 30/3/2000
Valued Guide
Posts: 632

Re: A Proc Sort Mystery. Please help!

BTW, in the dataset report_date is already a date format (not numeric or anything)

I am a bit worried about this comment in the OP. I suspect that there is a missunderstanding about what is a SAS date, which IS a numeric value. Hopefully you are using SAS dates and not a character string - which can have sorting issues for forms other than yy/mm/dd.
Contributor
Posts: 72

Re: A Proc Sort Mystery. Please help!

Sorry for this confusion. Report_date is a numeric but formatted ddmmyy10.. That's why I could use YEAR() with it to extract the year as in the first program.

Thanks
SAS Super FREQ
Posts: 8,743

Re: A Proc Sort Mystery. Please help!

Hi:
SAS dates are internally stored as number which represents an offset from 1/1/1960 -- so the FORMAT of the internally stored number has no impact on why the YEAR function worked. As long as REPORT_DATE represented a number that was a valid date value, the YEAR function was OK.

It seems to me that SPR and many others have pointed out that LAST.REPORT_DATE would never work correctly as the logic to get your desired rows output. But in order to understand WHY that is the case, it might be useful for you to run the program below and examine the values for all the FIRST. and LAST. variables that were set when you used BY group processing.

If you examine the output, you will see that LAST.REPORT_DATE is set to 1 for EVERY row in your input data. So LAST.REPORT_DATE as an output condition would never get you the desired rows. And even though you consider the dates to be different -inside- the group, as far as BY group processing is concerned, EVERY SINGLE DATE is the LAST.REPORT_DATE (or the automatic variable LAST.REPORT_DATE = 1).

Let's take some slightly different data to consider:
[pre]
FUND FIRM REPORT_DATE
A B 30/5/2000
A B 31/12/2000 <-- you want this
A B 30/3/2001
A B 30/3/2001 <---note duplicate report date
A B 30/6/2001
A B 31/12/2001 <-- you want this
[/pre]

Now, let's look at the value of LAST.REPORT_DATE for ONLY those rows of data (from the SAS log):
[pre]
_N_=1 Fund_ID=A Firm_ID=B Report_Date=30MAY2000 FIRST.Report_Date=1 LAST.Report_Date=1
_N_=2 Fund_ID=A Firm_ID=B Report_Date=31DEC2000 FIRST.Report_Date=1 LAST.Report_Date=1
_N_=3 Fund_ID=A Firm_ID=B Report_Date=30MAR2001 FIRST.Report_Date=1 LAST.Report_Date=0
_N_=4 Fund_ID=A Firm_ID=B Report_Date=30MAR2001 FIRST.Report_Date=0 LAST.Report_Date=1
_N_=5 Fund_ID=A Firm_ID=B Report_Date=30JUN2001 FIRST.Report_Date=1 LAST.Report_Date=1
_N_=6 Fund_ID=A Firm_ID=B Report_Date=31DEC2001 FIRST.Report_Date=1 LAST.Report_Date=1
[/pre]

Note how LAST.REPORT_DATE = 1 for all the observations EXCEPT for #3, the first obs for 03/30/2001. When you coded IF LAST.REPORT_DATE, you were asking SAS to output for every obs where LAST.REPORT_DATE=1 -- this is why LAST.REPORT_DATE logic did not give you the desired rows.

Now, let's look at the YEAR for every obs (shown here as the R_YEAR variable same obs as above shown in SAS log with R_YEAR automatic variables):
[pre]
_N_=1 Fund_ID=A Firm_ID=B r_year=2000 FIRST.r_year=1 LAST.r_year=0
_N_=2 Fund_ID=A Firm_ID=B r_year=2000 FIRST.r_year=0 LAST.r_year=1
_N_=3 Fund_ID=A Firm_ID=B r_year=2001 FIRST.r_year=1 LAST.r_year=0
_N_=4 Fund_ID=A Firm_ID=B r_year=2001 FIRST.r_year=0 LAST.r_year=0
_N_=5 Fund_ID=A Firm_ID=B r_year=2001 FIRST.r_year=0 LAST.r_year=0
_N_=6 Fund_ID=A Firm_ID=B r_year=2001 FIRST.r_year=0 LAST.r_year=1
[/pre]

Note in the above log output how LAST.R_YEAR is set to 1 for obs #2 and obs #6 -- your desired rows for output.

When you use BY group processing and the automatic variables FIRST.byvar and LAST.byvar, you must always keep in mind that the entire GROUP of rows is treated as a unit -- so the FIRST. and LAST. conditions apply to the WHOLE unit -- that's why you needed a finer level of granularity (year-level) within the group in order to select the rows you wanted.

cynthia
[pre]
** Make some data;
data funds;
infile datalines dlm = ' ';
input Fund_ID $ Firm_ID $ Report_Date : anydtdte.;

** create a report year (r_year) variable;
r_year = year(report_date);
format report_date date9.;
return;
datalines;
A B 30/5/2000
A B 31/12/2000
A B 30/3/2001
A B 30/3/2001
A B 30/6/2001
A B 31/12/2001
;
run;

proc sort data=funds;
by fund_id firm_id r_year report_date;
run;

** Create multiple datasets from this initial data;
** using LAST. logic to selectively output rows;
** but first, need to "grab" the values for all the automatic variables.;
data what_are_auto_vars
last_yr_only
last_rd_only
last_firm_only;
set funds;
by fund_id firm_id r_year report_date;

** Capture the values of ALL the automatic variables;
** to see how SAS is setting the FIRST. and LAST. values.;
first_fnd = first.fund_id;
last_fnd = last.fund_id;
first_frm = first.firm_id;
last_frm = last.firm_id;
first_yr = first.r_year;
last_yr = last.r_year;
first_rd = first.report_date;
last_rd = last.report_date;


putlog _n_= fund_id= firm_id= report_date= first.report_date= last.report_date=;
*putlog _n_= fund_id= firm_id= r_year= first.r_year= last.r_year=;

** now, output everything and then;
** output for most of the other LAST. variables;
output what_are_auto_vars;
if last.firm_id then output last_firm_only;
if last.report_date then output last_rd_only;
if last.r_year then output last_yr_only;
run;

ods listing close;
ods html file='c:\temp\understand_first_last.html' style=sasweb;
proc report data=what_are_auto_vars nowd;
title 'What are Automatic Variables with BY Processing?';
column ('FUND FIRST. and LAST.' fund_id first_fnd last_fnd)
('FIRM FIRST. and LAST.' firm_id first_frm last_frm)
('YEAR FIRST. and LAST.' r_year first_yr last_yr)
('DATE FIRST. and LAST.' report_date first_rd last_rd)
('Internal Number' report_date=unfmt_date);
define report_date / f=date9.;
define unfmt_date / f=best6.;
run;

proc report data=last_yr_only nowd;

title 'What is output for last.r_year -- matches desired output';
column ('FUND FIRST. and LAST.' fund_id first_fnd last_fnd)
('FIRM FIRST. and LAST.' firm_id first_frm last_frm)
('YEAR FIRST. and LAST.' r_year first_yr last_yr)
('DATE FIRST. and LAST.' report_date first_rd last_rd);
run;

proc report data=last_rd_only nowd;
title 'What is output for last.report_date (wrong)';
column ('FUND FIRST. and LAST.' fund_id first_fnd last_fnd)
('FIRM FIRST. and LAST.' firm_id first_frm last_frm)
('YEAR FIRST. and LAST.' r_year first_yr last_yr)
('DATE FIRST. and LAST.' report_date first_rd last_rd);
run;

proc report data=last_firm_only nowd;
title 'What is output for last.firm_id (wrong)';
column ('FUND FIRST. and LAST.' fund_id first_fnd last_fnd)
('FIRM FIRST. and LAST.' firm_id first_frm last_frm)
('YEAR FIRST. and LAST.' r_year first_yr last_yr)
('DATE FIRST. and LAST.' report_date first_rd last_rd);
run;
title;
ods _all_ close;
[/pre]
Ask a Question
Discussion stats
  • 16 replies
  • 178 views
  • 0 likes
  • 8 in conversation