I have a dataset with many duplicates for each customer ID, what I want to achieve is keeping only one record in each month for one customer. If multiple records in a month, select the record with the earliest date. How could I achieve this? I have tried first. , last. but it only gives me one cust_ID one record.
for example:
cust_id date month
12345 21APR23 15:48:30
12345 23APR23 15:50:30
12345 23APR23 17:48:30
12345 25APR23 11:48:30
12345 21Jun23 15:35:30
want:
12345 21APR23 15:48:30
12345 21Jun23 15:35:30
data Have;
input cust_id $ date datetime.;
format date datetime21. month monyy7.;
month = intnx('MONTH', datepart(date), 0, 'B');
datalines;
12345 21APR2023:15:48:30
12345 23APR2023:15:50:30
12345 23APR2023:17:48:30
12345 25APR2023:11:48:30
12345 21Jun2023:15:35:30
;
run;
proc sort data = Have;
by cust_id month date;
run;
data Want;
set Have;
by cust_id month;
if first.month;
run;
data Have;
input cust_id $ date datetime.;
format date datetime21. month monyy7.;
month = intnx('MONTH', datepart(date), 0, 'B');
datalines;
12345 21APR2023:15:48:30
12345 23APR2023:15:50:30
12345 23APR2023:17:48:30
12345 25APR2023:11:48:30
12345 21Jun2023:15:35:30
;
run;
proc sort data = Have;
by cust_id month date;
run;
data Want;
set Have;
by cust_id month;
if first.month;
run;
If your dataset has a DATE variable but doesn't already have a MONTH variable, you don't need to create MONTH:
data want;
set have;
by cust_id ;
if first.id=1 or intck('month',date,lag(date))^=0;
run;
Stealing the example data set from @SASKiwi
data Have; input cust_id $ date datetime.; format date datetime21. month monyy7.; month = intnx('MONTH', datepart(date), 0, 'B'); datalines; 12345 21APR2023:15:48:30 12345 23APR2023:15:50:30 12345 23APR2023:17:48:30 12345 25APR2023:11:48:30 12345 21Jun2023:15:35:30 ; run; proc summary data=have nway; class cust_id date; format date dtmonyy7. ; output out=want (drop=_:); run; proc print data=want; format date datetime18.; run;
When you know an appropriate format and apply it to a Class variable, such as the Dtmonyy, then summary will return the lowest numeric value in the format created group as the value. This particular summary otherwise counts observations so we drop those count variables as not wanted for this exercise.
Note: Class may not be appropriate for large data sets with many combinations of the Class variables.
Maybe give a try with a month variable and proc sort nodupkey:
data snacks; set SASHELP.snacks; month=compress(year(Date)||month(date)); run; proc sort data=snacks nodupkey;by product month;run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.