BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
danielchoi626
Calcite | Level 5

I am currently working on a dataset that has duplicates in the primary key as shown in the example table I have given below.

PROC SORT DATA = SAMPLE_TABLE NODUPKEY;
BY ID_NO;
RUN;

Dropping duplicates using the above snippet resulted in values being lost however. I would very much appreciate some advice on how to drop duplicates and preserve non-null values.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Sajid01
Meteorite | Level 14

Hello @danielchoi626 
This can be done as shown below. Modify as needed.
I have used your excel as the basis. In your  excel had amount as character. So I have added code to convert it to number.

/*Import your excel sheet */

PROC IMPORT DATAFILE='/home/yourfolder/SAMPLE DATA.xlsx'
	DBMS=XLSX
	OUT=WORK.IMPORT replace;
	GETNAMES=YES;
RUN;
/* Your amount was character.
I am converting  the amount to number. If in your original dataset amount is numeric, then this step not needed */
data import;
set import (Rename=(Amount=Amount_));
Amount=input(Amount_,best12.);
drop amount_;
run;

/* Remove duplicates*/
proc sql;
select distinct a.id_no, b.appl_no,b.Amount from import as a
left join import as b
on a.ID_no=b.ID_no
where b.amount is not missing
order by ID_no;
quit;

The  output will be as you wanted. The not null values from amount have been retained with no duplicates

Sajid01_0-1625588364899.png

Let me know if there are questions.

 

 

View solution in original post

4 REPLIES 4
Kurt_Bremser
Super User

In a PROC SORT that does change more than just the order, it is strongly recommended to use the OUT= option to create a new dataset and leave the source dataset untouched.

Additionally, you can use the DUPOUT= option to store the filtered observations.

andreas_lds
Jade | Level 19

Sorry, but from your description it is not clear what you want to keep and what to drop. If you want to drop all obs having a missing value in id_no, then use a where statement.

Astounding
PROC Star

It's relatively easy:

proc sort data=sample_table;
   by id_no;
run;

data want;
   update sample_table (obs=0) sample_table;
   by id_no;
run;

For each ID_NO, this technique will use the last non-missing value for each variable.  Note, however, that you may still lose data.  It is conceivable that a variable has two different non-missing values in different observations.

Sajid01
Meteorite | Level 14

Hello @danielchoi626 
This can be done as shown below. Modify as needed.
I have used your excel as the basis. In your  excel had amount as character. So I have added code to convert it to number.

/*Import your excel sheet */

PROC IMPORT DATAFILE='/home/yourfolder/SAMPLE DATA.xlsx'
	DBMS=XLSX
	OUT=WORK.IMPORT replace;
	GETNAMES=YES;
RUN;
/* Your amount was character.
I am converting  the amount to number. If in your original dataset amount is numeric, then this step not needed */
data import;
set import (Rename=(Amount=Amount_));
Amount=input(Amount_,best12.);
drop amount_;
run;

/* Remove duplicates*/
proc sql;
select distinct a.id_no, b.appl_no,b.Amount from import as a
left join import as b
on a.ID_no=b.ID_no
where b.amount is not missing
order by ID_no;
quit;

The  output will be as you wanted. The not null values from amount have been retained with no duplicates

Sajid01_0-1625588364899.png

Let me know if there are questions.

 

 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 2695 views
  • 1 like
  • 5 in conversation