Hi Experts,
I have a dataset which has repeated reference_number. I do not want to delete anything but just want to add a new column which can show 1 if the reference_number is repeaded or else 0. Can you please suggest how can I do that. I have used the below code. This highlight the first time reference_number as 0 and the repeated reference_number as 1. I want if the reference_number is not repeated then 0 and if repeated then 1.
data repeated_accounts;
set Equifax_files;
if lag(reference_number)=reference_number then repeat_flag=1;
else repeat_flag=0;
run;
If you want a flag that indicates if the reference number is UNIQUE (appears only in one observation) and the data is sorted then you need to test both the FIRST. and the LAST. flags.
So this data step will create a flag variable UNIQUE that will be 1 (TRUE) when this is the only observation with that value and 0 (FALSE) when it is any of multiple observations with the same value.
data want;
set have;
by ref_no;
unique = first.ref_no and last.ref_no ;
run;
It sounds like your data is sorted by reference_number. If I'm understanding your goal, you can do it like (untested):
data repeated_accounts;
set Equifax_files;
by reference_number ;
flag = NOT (first.reference_number and last.reference_number) ;
run;
Try this:
/* data */
data Equifax_files;
input reference_number $ 1. @@;
if reference_number;
cards;
123456673399822267234
;
run;
proc print;
run;
/*process*/
proc sort data=Equifax_files;
by reference_number;
run;
data Equifax_files;
set Equifax_files;
by reference_number;
marker = not first.reference_number;
run;
proc print;
run;
Bart
@Tom 's answer seems to be doing the job, i.e.:
marker = not (first.reference_number and last.reference_number);
Bart
Example before and result desired.
Since your LAG attempt, which doesn't behave as you want because of the nature of LAG, would only have a chance of working if the reference is sorted:
data example; input ref; datalines; 1 1 1 2 3 3 4 ; data want; set example; by ref; flag= not(first.ref); run;
LAG, and DIF, are queue functions. So when you use IF lag() the "lagged" value is the last time the IF was true, not the previous record.
To use Lag you would do something like:
data repeated_accounts; set Equifax_files; Lref = lag(Reference_number); if lfref=reference_number then repeat_flag=1; else repeat_flag=0; drop lref; run;
If you want a flag that indicates if the reference number is UNIQUE (appears only in one observation) and the data is sorted then you need to test both the FIRST. and the LAST. flags.
So this data step will create a flag variable UNIQUE that will be 1 (TRUE) when this is the only observation with that value and 0 (FALSE) when it is any of multiple observations with the same value.
data want;
set have;
by ref_no;
unique = first.ref_no and last.ref_no ;
run;
Thank you all!
@Sandeep77 wrote:
Thank you but it still shows as the first reference_number as 0 and the repeated reference_number as 1. I want if the reference_number is repeated then put it as 1 or else 0. So that I can filter out 0 and conclude them as they are not repeated and unique reference_numbers from the data.
I submit that your descriptions need to include example DATA and desired result.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
For SAS newbies, this video is a great way to get started. James Harroun walks through the process using SAS Studio for SAS OnDemand for Academics, but the same steps apply to any analytics project.
Find more tutorials on the SAS Users YouTube channel.