I've got SAS code that extracts data from a Peoplesoft environment (Oracle). It was written years ago and I have to decipher it without any documentation, and move it into a Peoplecode Application Engine using SQL steps.
One piece of code clearly shows the developer finding duplicates using the following key.
STDNT_KEY = EMPLID || STRM || CLASS_NBR;
DATA DUPS_CDUP_ENRL;
SET CLS_CDUP_ENRL;
BY STDNT_KEY;
IF (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1);
KEEP OWNER EMPLID STRM SESSION_CODE SUBJECT CATALOG_NBR CLASS_SECTION DUP_CODE;
RUN;
PROC EXPORT DATA=DUPS_CDUP_ENRL
OUTFILE= "C:\Processes\C_DUP\for_1058\DUPLICATE_CDUP_ENTRIES.xls"
DBMS=EXCEL2000 REPLACE;
RUN;
In my understanding if FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1 then he would not have any duplicates, so he appears to only look for any row that is a duplicate. True?
Again...I have no business rules or documentation so I have to try and understand the intent of the developer via the code.
Later I see the following code...
IF (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN DELETE;
OUTPUT CDUP_TRANS;
RUN;
PROC EXPORT DATA=CDUP_TRANS
OUTFILE= "C:\Processes\C_DUP\for_1058\CDUP_TRANS.xls"
DBMS=EXCEL2000 REPLACE;
RUN;
I don't know what CDUP_TRANS output is supposed to represent and don't know if the developer was only looking for dups STDNT_KEY = 1 for both first and last. If so why did they exclude (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) from the CDUP_TRANS code?
Is
(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1)
redundant?
In my understanding if FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1 then he would not have any duplicates, so he appears to only look for any row that is a duplicate. True?
That is true he is keeping STDNT's data who have more than one entries.
I don't know what CDUP_TRANS output is supposed to represent and don't know if the developer was only looking for dups STDNT_KEY = 1 for both first and last. If so why did they exclude (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) from the CDUP_TRANS code?
In this code he summarize each from those who have more than one entries, representing them by the first entry only. For summarize or reporting...
But your code's first part is not complete. So it could be that he is report all the STDNT in the database table without duplicate. So the STDNT with more than one entries will be represented by the first record and those who has one entries will also be represented.
Is(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1)
redundant?
No,
(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... mean the first entry in the group
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... mean the last entry in the group
*****
For example:
STDNT_KEY
1
2
3
1
1
3
After the data sorted by STDNT_KEY:
STDNT_KEY
1
1
1
2
3
3
By understanding the first and last temporary SAS variables in the sorted data, and understanding that each unique student entries is a group in it's self:
STDNT_KEY
1 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group
1 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) ... not the first or the last
1 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group
2 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last
3 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group
3 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group
*********
So
IF (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN OUTPUT;
gives you
STDNT_KEY
1 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group
1 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0)
1 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group
2 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last
3 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group
3 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group
*******
And
IF (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN DELETE;;
gives you:
STDNT_KEY
1 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group
1 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0)
1 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group
2 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last
3 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group
3 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group
In my understanding if FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1 then he would not have any duplicates, so he appears to only look for any row that is a duplicate. True?
That is true he is keeping STDNT's data who have more than one entries.
I don't know what CDUP_TRANS output is supposed to represent and don't know if the developer was only looking for dups STDNT_KEY = 1 for both first and last. If so why did they exclude (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) from the CDUP_TRANS code?
In this code he summarize each from those who have more than one entries, representing them by the first entry only. For summarize or reporting...
But your code's first part is not complete. So it could be that he is report all the STDNT in the database table without duplicate. So the STDNT with more than one entries will be represented by the first record and those who has one entries will also be represented.
Is(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1)
redundant?
No,
(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... mean the first entry in the group
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... mean the last entry in the group
*****
For example:
STDNT_KEY
1
2
3
1
1
3
After the data sorted by STDNT_KEY:
STDNT_KEY
1
1
1
2
3
3
By understanding the first and last temporary SAS variables in the sorted data, and understanding that each unique student entries is a group in it's self:
STDNT_KEY
1 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group
1 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) ... not the first or the last
1 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group
2 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last
3 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group
3 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group
*********
So
IF (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN OUTPUT;
gives you
STDNT_KEY
1 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group
1 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0)
1 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group
2 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last
3 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group
3 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group
*******
And
IF (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR
(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN DELETE;;
gives you:
STDNT_KEY
1 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group
1 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0)
1 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group
2 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last
3 (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group
3 (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group
Hi.
How about you try this:
IF FIRST.STDNT_KEY = 1 and LAST.STDNT_KEY = 1 then delete;
else output;
this way - records that are unique (first and last = 1) are deleted from the data table and only the duplicate records are kept.
Being a bit pedantic I would say records where the key is duplicated, not duplicate records.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.