I am currently working on a dataset that has duplicates in the primary key as shown in the example table I have given below.
PROC SORT DATA = SAMPLE_TABLE NODUPKEY;
BY ID_NO;
RUN;
Dropping duplicates using the above snippet resulted in values being lost however. I would very much appreciate some advice on how to drop duplicates and preserve non-null values.
Hello @danielchoi626
This can be done as shown below. Modify as needed.
I have used your excel as the basis. In your excel had amount as character. So I have added code to convert it to number.
/*Import your excel sheet */
PROC IMPORT DATAFILE='/home/yourfolder/SAMPLE DATA.xlsx'
DBMS=XLSX
OUT=WORK.IMPORT replace;
GETNAMES=YES;
RUN;
/* Your amount was character.
I am converting the amount to number. If in your original dataset amount is numeric, then this step not needed */
data import;
set import (Rename=(Amount=Amount_));
Amount=input(Amount_,best12.);
drop amount_;
run;
/* Remove duplicates*/
proc sql;
select distinct a.id_no, b.appl_no,b.Amount from import as a
left join import as b
on a.ID_no=b.ID_no
where b.amount is not missing
order by ID_no;
quit;
The output will be as you wanted. The not null values from amount have been retained with no duplicates
Let me know if there are questions.
In a PROC SORT that does change more than just the order, it is strongly recommended to use the OUT= option to create a new dataset and leave the source dataset untouched.
Additionally, you can use the DUPOUT= option to store the filtered observations.
Sorry, but from your description it is not clear what you want to keep and what to drop. If you want to drop all obs having a missing value in id_no, then use a where statement.
It's relatively easy:
proc sort data=sample_table;
by id_no;
run;
data want;
update sample_table (obs=0) sample_table;
by id_no;
run;
For each ID_NO, this technique will use the last non-missing value for each variable. Note, however, that you may still lose data. It is conceivable that a variable has two different non-missing values in different observations.
Hello @danielchoi626
This can be done as shown below. Modify as needed.
I have used your excel as the basis. In your excel had amount as character. So I have added code to convert it to number.
/*Import your excel sheet */
PROC IMPORT DATAFILE='/home/yourfolder/SAMPLE DATA.xlsx'
DBMS=XLSX
OUT=WORK.IMPORT replace;
GETNAMES=YES;
RUN;
/* Your amount was character.
I am converting the amount to number. If in your original dataset amount is numeric, then this step not needed */
data import;
set import (Rename=(Amount=Amount_));
Amount=input(Amount_,best12.);
drop amount_;
run;
/* Remove duplicates*/
proc sql;
select distinct a.id_no, b.appl_no,b.Amount from import as a
left join import as b
on a.ID_no=b.ID_no
where b.amount is not missing
order by ID_no;
quit;
The output will be as you wanted. The not null values from amount have been retained with no duplicates
Let me know if there are questions.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.