topic Re: How do I remove duplicates from a Proc Print Table after certain variables? in New SAS User

How do I remove duplicates from a Proc Print Table after certain variables?

ChrisH10 — Wed, 23 Oct 2019 04:14:40 GMT

Hi everyone,

So I have a 9000 line data set (work.data) that displays loans and their associated values with them. One of the values: PERIOD_END_LSTAT tells me if the payments are overdue ("In Grace Period" or "Late (31-120 days)" or "Late (16-30 days)").

Since the payments that occur more than once for the same loan, there are multiple rows of late payments (duplicates). As you can see in the screenshot, the observation 339 & 340 are the same, but I just want it to show up once.

Below, I've sorted all of the late payments from the data set, but I just want to know how many unique loans have late payments:

Proc Print Data=work.data; 
where PERIOD_END_LSTAT = "In Grace Period" 
or PERIOD_END_LSTAT = "Late (31-120 days)" 
or PERIOD_END_LSTAT = "Late (16-30 days)";
Run;

How can I remove the same loans that appear more than once? I just want it to show up once so I can count the unique loans.

Thanks!

Re: How do I remove duplicates from a Proc Print Table after certain variables?

ballardw — Mon, 21 Oct 2019 20:18:26 GMT

@ChrisH10 wrote:

Hi everyone,

So I have a 9000 line data set (work.data) that displays loans and their associated values with them. One of the values: PERIOD_END_LSTAT tells me if the payments are overdue ("In Grace Period" or "Late (31-120 days)" or "Late (16-30 days)").

Since the payments that occur more than once for the same loan, there are multiple rows of late payments (duplicates). As you can see in the screenshot, the observation 339 & 340 are the same, but I just want it to show up once.

Below, I've sorted all of the late payments from the data set, but I just want to know how many unique loans have late payments:
Proc Print Data=work.data; 
where PERIOD_END_LSTAT = "In Grace Period" 
or PERIOD_END_LSTAT = "Late (31-120 days)" 
or PERIOD_END_LSTAT = "Late (16-30 days)";
Run;
How can I remove the same loans that appear more than once? I just want it to show up once so I can count the unique loans.

Thanks! I'm very very new to SAS but need to learn this for a job offer.

Which record do you want to display? First? Last? in the middle of a series? And if a load goes from "Late (16-30 days" to "Late (31-120 days)" what do you want to show in the output?

You may also have to provide the minimum list of variables needed to uniquely identify a "loan".

Likely you will need a data step to filter the records depending on your answers to the above questions.

Re: How do I remove duplicates from a Proc Print Table after certain variables?

ChrisH10 — Mon, 21 Oct 2019 20:27:42 GMT

Well my main goal right now is to just make the table removed of all the duplicates where all of the columns are the same to the one above it (as in obs 339 and 340). I would want it to display all of the columns as shown in the screenshot. I mean the output would just need to be the removed duplicates table and a count of how many loans there are.

The minimum list of variables I have decided to identify a loan are ID, IssuedDate and loan_amnt (not shown on screenshot):

proc sort data=DATA out=SAS_Sorted_DATA nodupkey;

by id loan_amnt IssuedDate;

run;

Hopefully that helps!

Re: How do I remove duplicates from a Proc Print Table after certain variables?

mkeintz — Mon, 21 Oct 2019 20:40:04 GMT

You apparently want to drop a line if it is a complete duplicate of the prior line. If so, then make a temporary data set with the duplicates removed (data set view NEED2 below), and then PRINT:

data need1 / view=need1;
  set have;
  retain _sentinel .;
run;
data need2 (drop=_sentinel:) /view=need2;
  set need1;
  by _all_;
  if first._sentinel;
run;

proc print data=need2;
run;

Data set NEED1 is nothing but your original dataset with new variable _SENTINEL set to the right of all the other variables.

Data set NEED2 reads NEED1, using the BY _ALL_ NOTSORTED statement. The consequence is that the "if first._sentinel" subsetting keeps only records in which _SENTINEL changes or any variable to its left changes. The result is a data set that you want.

Re: How do I remove duplicates from a Proc Print Table after certain variables?

ChrisH10 — Mon, 21 Oct 2019 22:22:53 GMT

Thanks for your reply!

So in your code, if the main data set is titled "work.data", would I be replacing the "have" in data need1 with "work.data"? or do I replace need1 with work.data?

data need1 / view=need1;
  set work.data;
  retain _sentinel .;
run;

data need2 (drop=_sentinel:) /view=need2;
  set need1;
  by _all_;
  if first._sentinel;
run;

proc print data=need2;
run;

or does it need to look like this?

data work.data / view=work.data;
  set have;
  retain _sentinel .;
run;

data need2 (drop=_sentinel:) /view=need2;
  set work.data;
  by _all_;
  if first._sentinel;
run;

proc print data=need2;
run;

Re: How do I remove duplicates from a Proc Print Table after certain variables?

mkeintz — Mon, 21 Oct 2019 22:54:18 GMT

@ChrisH10 wrote:

Thanks for your reply!

So in your code, if the main data set is titled "work.data", would I be replacing the "have" in data need1 with "work.data"? or do I replace need1 with work.data?

I have a two-part answer:

Give it a try and see what happens.
Yes, my dataset HAVE is an allegory for your WORK.DATA.

Re: How do I remove duplicates from a Proc Print Table after certain variables?

ChrisH10 — Mon, 21 Oct 2019 23:21:26 GMT

Great! Thanks for your help. I made a new data set called Work.Delinquentloans and used your code to narrow most of it down:

data WORK.DELINQUENTLOANS;
	set work.data;
	where PERIOD_END_LSTAT = "In Grace Period" 
	or PERIOD_END_LSTAT = "Late (31-120 days)" 
	or PERIOD_END_LSTAT = "Late (16-30 days)";
run;


data need1 / view=need1;
  set WORK.DELINQUENTLOANS;
  retain _sentinel .;
run;

data need2 (drop=_sentinel:) /view=need2;
  set need1;
  by _all_ notsorted;
  if first._sentinel;
run;

proc print data=need2;
run;

The last thing I realized running the new code is that there are a couple variables (INT_PAID, PRINCP_PAID and RECIEVED_AMT) that have values that need to be ignored for searching for duplicates. As you can see in Obs 3,4 & 5, it's the same loan but those specific variables are making it think it's three different loans. Any ideas on how to ignore some variables? Thanks again, it's much appreciated!

Re: How do I remove duplicates from a Proc Print Table after certain variables?

ballardw — Tue, 22 Oct 2019 15:19:46 GMT

@ChrisH10 wrote:

Great! Thanks for your help. I made a new data set called Work.Delinquentloans and used your code to narrow most of it down:
data WORK.DELINQUENTLOANS;
	set work.data;
	where PERIOD_END_LSTAT = "In Grace Period" 
	or PERIOD_END_LSTAT = "Late (31-120 days)" 
	or PERIOD_END_LSTAT = "Late (16-30 days)";
run;


data need1 / view=need1;
  set WORK.DELINQUENTLOANS;
  retain _sentinel .;
run;

data need2 (drop=_sentinel:) /view=need2;
  set need1;
  by _all_ notsorted;
  if first._sentinel;
run;

proc print data=need2;
run;
The last thing I realized running the new code is that there are a couple variables (INT_PAID, PRINCP_PAID and RECIEVED_AMT) that have values that need to be ignored for searching for duplicates. As you can see in Obs 3,4 & 5, it's the same loan but those specific variables are making it think it's three different loans. Any ideas on how to ignore some variables? Thanks again, it's much appreciated!

Pick a specific list of variable names that you do need to determine "loan". Replace the _all_ with that list in this data step.

data need2 (drop=_sentinel:) /view=need2;
  set need1;
  by _all_ notsorted;
  if first._sentinel;
run;

Re: How do I remove duplicates from a Proc Print Table after certain variables?

mkeintz — Wed, 23 Oct 2019 14:19:27 GMT

So you now are telling us that there are a few variables that you want to ignore when detecting duplicate records. If there aren't too many variable to consider in the comparison of consecutive records, then list them in the DATA NEED2 step as @ballardw suggests. Or better yet, skip the DATA NEED2 step entirely and modify the DATA NEED1 step. Let's say you only need to compare 4 varaibles (A,B,C and D). Then you could:

data WORK.DELINQUENTLOANS;
	set work.data;
	where PERIOD_END_LSTAT = "In Grace Period" 
	or PERIOD_END_LSTAT = "Late (31-120 days)" 
	or PERIOD_END_LSTAT = "Late (16-30 days)";
run;


data need1 / view=need1;
  set WORK.DELINQUENTLOANS;
  by A B C D notsorted;
  if first.D;
run;


proc print data=need1;
run;

But if you have a lot of variables to compare, and only 3 to ignore (then I would keep the "BY _ALL_" but control what _ALL_ actually represents, as in

data WORK.DELINQUENTLOANS;
	set work.data;
	where PERIOD_END_LSTAT = "In Grace Period" 
	or PERIOD_END_LSTAT = "Late (31-120 days)" 
	or PERIOD_END_LSTAT = "Late (16-30 days)";
run;


data need1 / view=need1;
  retain _sentinel1 .;
  if 0 then set work.delinquentloans (drop=INT_PAID, PRINCP_PAID RECIEVED_AMT);
  retain _sentinel2 .;
  set work.delinquentloans;
run;

 
data need2 (drop=_sentinel:) /view=need2;
  set need1;
  by _sentinel1--_sentinel2 notsorted;
  if first._sentinel2;
run;

proc print data=need2;
run;

What's different here?:

DATA NEED1 has variables in this order, from left to right:
1. _sentinel1
2. all the original variables except INT_PAID, PRINCP_PAID and RECIEVED_AMT
3. _sentinel2
4. INT_PAID, PRINCP_PAID and RECIEVED_AMT
The reason DATA NEED1 has this is due the the "IF 0 then SET work.delinquentloans (drop=...)" which forces the SAS compiler to arrange the program-data-vector to make accommodations for all the variables except INT_PAID, PRINCP_PAID and RECIEVED_AMT (because of the "drop=" option). But because it is "IF 0 THEN ..." the execution never happens. Then the compiler encounters the RETAIN _SENTINEL2 statement. Finally, the data is only actually read by the subsequent SET statement, where the 3 orphan variables are discovered by the compiler.
The result is that the DATA NEED2 statement BY _SENTINEL1 -- _SENTINEL2 tells SAS to compare all the variables between _SENTINEL1 and _SENTINEL2 inclusive.
1. Therefore the FIRST._SENTINEL2 dummy var tells you when _SENTINEL2 or any variable to its left has changed. I.e. it ignores the variables to the right of _SENTINEL2 ((INT_PAID, PRINCP_PAID and RECIEVED_AMT).

Edit change: corrected RETAIN SENTINEL2 to RETAIN _SENTINEL2.

Re: How do I remove duplicates from a Proc Print Table after certain variables?

ChrisH10 — Wed, 23 Oct 2019 03:19:48 GMT

Thanks! I run into the error with the code (when trying your second piece of code), saying that:

ERROR: _sentinel2 does not follow _sentinel1 on the list of previously defined variables.

I'd assume I would change _sentinel2 to just sentinel2 right? to this:

data need2 (drop=_sentinel:) /view=need2;
  set need1;
  by _sentinel1--sentinel2 notsorted;
  if first.sentinel2;
run;

Also, if I attempt to run the above code, it just creates a NEED2 with no filtered data. What should I do to fix this?

Re: How do I remove duplicates from a Proc Print Table after certain variables?

mkeintz — Wed, 23 Oct 2019 14:28:27 GMT

You correctly identified my error (now fixed in the relevant post). And yes, you suggesting of changing the spelling of _SENTINEL2 to SENTINEL2 will work. But I'd recommend the opposite (change SENTINEL2 to _SENTINEL2). That's because the statement

data need2 (drop=_sentinel:) / view=need2;

will unnecessarily allow SENTINEL2 to remain in data set NEED2. The "drop=_sentinel:" option removes all variables whose name begins with _SENTINEL (i.e. both _SENTINEL1 and _SENTINEL2). I use the option it to eliminate no-longer-useful information in the desired data set.