Hi Experts,
I have a dataset which has repeated reference_number. I do not want to delete anything but just want to add a new column which can show 1 if the reference_number is repeaded or else 0. Can you please suggest how can I do that. I have used the below code. This highlight the first time reference_number as 0 and the repeated reference_number as 1. I want if the reference_number is not repeated then 0 and if repeated then 1.
data repeated_accounts;
set Equifax_files;
if lag(reference_number)=reference_number then repeat_flag=1;
else repeat_flag=0;
run;
If you want a flag that indicates if the reference number is UNIQUE (appears only in one observation) and the data is sorted then you need to test both the FIRST. and the LAST. flags.
So this data step will create a flag variable UNIQUE that will be 1 (TRUE) when this is the only observation with that value and 0 (FALSE) when it is any of multiple observations with the same value.
data want;
set have;
by ref_no;
unique = first.ref_no and last.ref_no ;
run;
It sounds like your data is sorted by reference_number. If I'm understanding your goal, you can do it like (untested):
data repeated_accounts;
set Equifax_files;
by reference_number ;
flag = NOT (first.reference_number and last.reference_number) ;
run;
Try this:
/* data */
data Equifax_files;
input reference_number $ 1. @@;
if reference_number;
cards;
123456673399822267234
;
run;
proc print;
run;
/*process*/
proc sort data=Equifax_files;
by reference_number;
run;
data Equifax_files;
set Equifax_files;
by reference_number;
marker = not first.reference_number;
run;
proc print;
run;
Bart
@Tom 's answer seems to be doing the job, i.e.:
marker = not (first.reference_number and last.reference_number);
Bart
Example before and result desired.
Since your LAG attempt, which doesn't behave as you want because of the nature of LAG, would only have a chance of working if the reference is sorted:
data example; input ref; datalines; 1 1 1 2 3 3 4 ; data want; set example; by ref; flag= not(first.ref); run;
LAG, and DIF, are queue functions. So when you use IF lag() the "lagged" value is the last time the IF was true, not the previous record.
To use Lag you would do something like:
data repeated_accounts; set Equifax_files; Lref = lag(Reference_number); if lfref=reference_number then repeat_flag=1; else repeat_flag=0; drop lref; run;
If you want a flag that indicates if the reference number is UNIQUE (appears only in one observation) and the data is sorted then you need to test both the FIRST. and the LAST. flags.
So this data step will create a flag variable UNIQUE that will be 1 (TRUE) when this is the only observation with that value and 0 (FALSE) when it is any of multiple observations with the same value.
data want;
set have;
by ref_no;
unique = first.ref_no and last.ref_no ;
run;
Thank you all!
@Sandeep77 wrote:
Thank you but it still shows as the first reference_number as 0 and the repeated reference_number as 1. I want if the reference_number is repeated then put it as 1 or else 0. So that I can filter out 0 and conclude them as they are not repeated and unique reference_numbers from the data.
I submit that your descriptions need to include example DATA and desired result.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.