Hello,
From long time, if I want to group by one ID field, then keep the first record by second var, I do this:
"proc sort data=inputx out=a; by ID month; data b ; set a; by id month; if first.ID; run;"
That has always given me the first, earliest month, to represent the ID. Call it 'practice X'.
Today I saw (still inheriting others' code) below (call it "practice Y")
"proc sort data=inputx out=a; by ID month;
proc sort data=a out=c nodupkey; by ID ; run;"
I vaguely remember when I read (sugi?) papers to adopt practice X, the motivation argued then was because practice Y does not guarantee the first month will be kept (in most cases it does? ).
So I sort both datasets c and b by ID, month. Then :
"data compx; merge c(in=c) b(in=b); by ID month; if c and b then matched=1; else matched=2;
run;"
The total count ID from the two inputs are identical, should be. AND--- Matched ALL=1, according to proc freq. In this comparison. Certainly I have not worked with every single release /version variation of BASE SAS in the past. Could be at certain point of BASE evolution, practice Y was built to guarantee==practice X? I can stick with practice X and it always does as desired pretending not seeing practice Y. Practice Y, though, is the third time I have seen in one place. I audited one today. Was it coincidence that it matched practice X, or BASE has evolved that practice Y should be generally considered equivalent to practice X? Thanks. Jia
The two methods are the same. Both order the data by ID and MONTH and then select just the first observation per ID.
The only reason they wouldn't would be if you added the NOEQUALS option to the last PROC SORT.
EQUALS | NOEQUALS
specifies the order of the observations in the output data set. For observations with identical BY-variable values, EQUALS maintains the relative order of the observations within the input data set in the output data set. NOEQUALS does not necessarily preserve this order in the output data set.
Default EQUALS Interactions When you use NODUPKEY to remove observations in the output data set, the choice of EQUALS or NOEQUALS can affect which observations are removed. The EQUALS | NOEQUALS procedure option overrides the default sort stability behavior that is established with the SORTEQUALS | NOSORTEQUALS system option. The EQUALS option is supported by the threaded sort. However, I/O performance might be reduced when using the EQUALS option with the threaded sort because partitioned data sets are processed as if they consist of a single partition. The NOEQUALS option is supported by the threaded sort. The order of observations within BY groups that are returned by the threaded sort might not be consistent between runs. Tip Using NOEQUALS can save CPU time and memory.
The main issue with NODUPKEY, as I remember it, is when you want a specific value of a variable that does not appear in the sort BY order when there are multiple values of all the BY variables.
When you are sorting by ID and Month and that is all you look it then the order, because you have them on the BY statement. The only reason the "first month" wouldn't appear would be if 1) month is Character and 2) you expect month to appear in chronological order of month names. If your month is numeric then you will get the lowest numeric value for month for each ID.
You might consider running this code and compare the two resulting data sets Sorted1 and Sorted2. It is very likely that the value of the C variable, which is not on the sort, is different between the sort results. This demonstrates that the input order can have an impact on the result when using NODUPKEY for variables not on the BY statement.
data example; input a b c; r=rand('uniform'); datalines; 1 1 5 1 1 2 1 1 27 2 2 27 2 2 2 2 2 5 ; proc sort data=example out=sorted1 nodupkey ; by a b; run; proc sort data=example; by r; run; proc sort data=example out=sorted2 nodupkey ; by a b; run;
So, Practice Y originates because of an option that used to exist (NODUP/NODUPRECS) option that requires sorted data, otherwise it won't remove the duplicates if they're not back to back (post on the topic). NODUPKEY does remove the duplicates in the BY variables regardless of position.
In theory, Practice Y and X should be identical, the case where they differ is if someone (hands up) has accidentally forgotten to double sort and uses NODUP. SUGI papers are from 2005 and earlier I think so you really need to factor that in sometimes, SAS is older than most data scientists.
Wikipedia has in part:
In 1985, in a lecture given to the Chinese Academy of Sciences in Beijing, C.F. Jeff Wu used the term Data Science for the first time as an alternative name for statistics.
which means that SAS is older than the phrase "data science" in this context.
I was taking classes using SAS 5 in 1985...
The two methods are the same. Both order the data by ID and MONTH and then select just the first observation per ID.
The only reason they wouldn't would be if you added the NOEQUALS option to the last PROC SORT.
EQUALS | NOEQUALS
specifies the order of the observations in the output data set. For observations with identical BY-variable values, EQUALS maintains the relative order of the observations within the input data set in the output data set. NOEQUALS does not necessarily preserve this order in the output data set.
Default EQUALS Interactions When you use NODUPKEY to remove observations in the output data set, the choice of EQUALS or NOEQUALS can affect which observations are removed. The EQUALS | NOEQUALS procedure option overrides the default sort stability behavior that is established with the SORTEQUALS | NOSORTEQUALS system option. The EQUALS option is supported by the threaded sort. However, I/O performance might be reduced when using the EQUALS option with the threaded sort because partitioned data sets are processed as if they consist of a single partition. The NOEQUALS option is supported by the threaded sort. The order of observations within BY groups that are returned by the threaded sort might not be consistent between runs. Tip Using NOEQUALS can save CPU time and memory.
Thank you all. I will just stick with my first.ID, practice X. When it does not offend, I will replace Y with X. It is not hard to do and takes away unnecessary discussion. Since nobody seems to object X is straight, clear and purpose-specific. Practice Y, while it may coincide with X in almost all cases, sounds a bit interpretive (like Python), among a 'hidden button' and other things.
Ballardw's comment led me to move up ~20 lines to confirm why in this case X==Y. Reeza's comment, on the other hand, does stir up some old memory, running SAS on MF vs Unix, $char vs. $, ....... It is true if Y fails on some formats on the second var, the chance is X will fail as well; relative, physical order often screws the same way regardless X or Y. Now I do recall cases where EQUALS and NOEQUALS do not work, therefore Y may not work as expected, but X does not have that dependency. Action by design, sorting is not supposed to be used to pin down and select, same logic underpinning why we often do not build clustering solution to rank order records.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.