Solved: is there a way to drop duplicate rows while keeping rows that are not ...

danielchoi626 · Posted 07-06-2021 05:36 AM

I am currently working on a dataset that has duplicates in the primary key as shown in the example table I have given below.

PROC SORT DATA = SAMPLE_TABLE NODUPKEY;

BY ID_NO;

RUN;

Dropping duplicates using the above snippet resulted in values being lost however. I would very much appreciate some advice on how to drop duplicates and preserve non-null values.

Sajid01 · Posted 07-06-2021 12:22 PM

Hello @danielchoi626
This can be done as shown below. Modify as needed.
I have used your excel as the basis. In your excel had amount as character. So I have added code to convert it to number.

/*Import your excel sheet */

PROC IMPORT DATAFILE='/home/yourfolder/SAMPLE DATA.xlsx'
	DBMS=XLSX
	OUT=WORK.IMPORT replace;
	GETNAMES=YES;
RUN;
/* Your amount was character.
I am converting  the amount to number. If in your original dataset amount is numeric, then this step not needed */
data import;
set import (Rename=(Amount=Amount_));
Amount=input(Amount_,best12.);
drop amount_;
run;

/* Remove duplicates*/
proc sql;
select distinct a.id_no, b.appl_no,b.Amount from import as a
left join import as b
on a.ID_no=b.ID_no
where b.amount is not missing
order by ID_no;
quit;

The output will be as you wanted. The not null values from amount have been retained with no duplicates

Let me know if there are questions.

View solution in original post

Kurt_Bremser · Posted 07-06-2021 06:00 AM

In a PROC SORT that does change more than just the order, it is strongly recommended to use the OUT= option to create a new dataset and leave the source dataset untouched.

Additionally, you can use the DUPOUT= option to store the filtered observations.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

andreas_lds · Posted 07-06-2021 06:17 AM

Sorry, but from your description it is not clear what you want to keep and what to drop. If you want to drop all obs having a missing value in id_no, then use a where statement.

Astounding · Posted 07-06-2021 09:05 AM

It's relatively easy:

proc sort data=sample_table;
   by id_no;
run;

data want;
   update sample_table (obs=0) sample_table;
   by id_no;
run;

For each ID_NO, this technique will use the last non-missing value for each variable. Note, however, that you may still lose data. It is conceivable that a variable has two different non-missing values in different observations.

Sajid01 · Posted 07-06-2021 12:22 PM

Hello @danielchoi626
This can be done as shown below. Modify as needed.
I have used your excel as the basis. In your excel had amount as character. So I have added code to convert it to number.

/*Import your excel sheet */

PROC IMPORT DATAFILE='/home/yourfolder/SAMPLE DATA.xlsx'
	DBMS=XLSX
	OUT=WORK.IMPORT replace;
	GETNAMES=YES;
RUN;
/* Your amount was character.
I am converting  the amount to number. If in your original dataset amount is numeric, then this step not needed */
data import;
set import (Rename=(Amount=Amount_));
Amount=input(Amount_,best12.);
drop amount_;
run;

/* Remove duplicates*/
proc sql;
select distinct a.id_no, b.appl_no,b.Amount from import as a
left join import as b
on a.ID_no=b.ID_no
where b.amount is not missing
order by ID_no;
quit;

The output will be as you wanted. The not null values from amount have been retained with no duplicates

Let me know if there are questions.

is there a way to drop duplicate rows while keeping rows that are not null values?

Re: is there a way to drop duplicate rows while keeping rows that are not null values?

Re: is there a way to drop duplicate rows while keeping rows that are not null values?

Re: is there a way to drop duplicate rows while keeping rows that are not null values?

Re: is there a way to drop duplicate rows while keeping rows that are not null values?

Re: is there a way to drop duplicate rows while keeping rows that are not null values?

Registration is open

is there a way to drop duplicate rows while keeping rows that are not null values?

Re: is there a way to drop duplicate rows while keeping rows that are not null values?

Re: is there a way to drop duplicate rows while keeping rows that are not null values?

Re: is there a way to drop duplicate rows while keeping rows that are not null values?

Re: is there a way to drop duplicate rows while keeping rows that are not null values?

Re: is there a way to drop duplicate rows while keeping rows that are not null values?

Registration is open

SAS Training: Just a Click Away