Hi everyone,
So I have a 9000 line data set (work.data) that displays loans and their associated values with them. One of the values: PERIOD_END_LSTAT tells me if the payments are overdue ("In Grace Period" or "Late (31-120 days)" or "Late (16-30 days)").
Since the payments that occur more than once for the same loan, there are multiple rows of late payments (duplicates). As you can see in the screenshot, the observation 339 & 340 are the same, but I just want it to show up once.
Below, I've sorted all of the late payments from the data set, but I just want to know how many unique loans have late payments:
Proc Print Data=work.data; where PERIOD_END_LSTAT = "In Grace Period" or PERIOD_END_LSTAT = "Late (31-120 days)" or PERIOD_END_LSTAT = "Late (16-30 days)"; Run;
How can I remove the same loans that appear more than once? I just want it to show up once so I can count the unique loans.
Thanks!
You apparently want to drop a line if it is a complete duplicate of the prior line. If so, then make a temporary data set with the duplicates removed (data set view NEED2 below), and then PRINT:
data need1 / view=need1;
  set have;
  retain _sentinel .;
run;
data need2 (drop=_sentinel:) /view=need2;
  set need1;
  by _all_;
  if first._sentinel;
run;
proc print data=need2;
run;Data set NEED1 is nothing but your original dataset with new variable _SENTINEL set to the right of all the other variables.
Data set NEED2 reads NEED1, using the BY _ALL_ NOTSORTED statement. The consequence is that the "if first._sentinel" subsetting keeps only records in which _SENTINEL changes or any variable to its left changes. The result is a data set that you want.
@ChrisH10 wrote:
Hi everyone,
So I have a 9000 line data set (work.data) that displays loans and their associated values with them. One of the values: PERIOD_END_LSTAT tells me if the payments are overdue ("In Grace Period" or "Late (31-120 days)" or "Late (16-30 days)").
Since the payments that occur more than once for the same loan, there are multiple rows of late payments (duplicates). As you can see in the screenshot, the observation 339 & 340 are the same, but I just want it to show up once.
Below, I've sorted all of the late payments from the data set, but I just want to know how many unique loans have late payments:
Proc Print Data=work.data; where PERIOD_END_LSTAT = "In Grace Period" or PERIOD_END_LSTAT = "Late (31-120 days)" or PERIOD_END_LSTAT = "Late (16-30 days)"; Run;
How can I remove the same loans that appear more than once? I just want it to show up once so I can count the unique loans.
Thanks! I'm very very new to SAS but need to learn this for a job offer.
Which record do you want to display? First? Last? in the middle of a series? And if a load goes from "Late (16-30 days" to "Late (31-120 days)" what do you want to show in the output?
You may also have to provide the minimum list of variables needed to uniquely identify a "loan".
Likely you will need a data step to filter the records depending on your answers to the above questions.
Well my main goal right now is to just make the table removed of all the duplicates where all of the columns are the same to the one above it (as in obs 339 and 340). I would want it to display all of the columns as shown in the screenshot. I mean the output would just need to be the removed duplicates table and a count of how many loans there are.
The minimum list of variables I have decided to identify a loan are ID, IssuedDate and loan_amnt (not shown on screenshot):
proc sort data=DATA out=SAS_Sorted_DATA nodupkey; by id loan_amnt IssuedDate; run;
Hopefully that helps!
You apparently want to drop a line if it is a complete duplicate of the prior line. If so, then make a temporary data set with the duplicates removed (data set view NEED2 below), and then PRINT:
data need1 / view=need1;
  set have;
  retain _sentinel .;
run;
data need2 (drop=_sentinel:) /view=need2;
  set need1;
  by _all_;
  if first._sentinel;
run;
proc print data=need2;
run;Data set NEED1 is nothing but your original dataset with new variable _SENTINEL set to the right of all the other variables.
Data set NEED2 reads NEED1, using the BY _ALL_ NOTSORTED statement. The consequence is that the "if first._sentinel" subsetting keeps only records in which _SENTINEL changes or any variable to its left changes. The result is a data set that you want.
Thanks for your reply!
So in your code, if the main data set is titled "work.data", would I be replacing the "have" in data need1 with "work.data"? or do I replace need1 with work.data?
data need1 / view=need1; set work.data; retain _sentinel .; run; data need2 (drop=_sentinel:) /view=need2; set need1; by _all_; if first._sentinel; run; proc print data=need2; run;
or does it need to look like this?
data work.data / view=work.data; set have; retain _sentinel .; run; data need2 (drop=_sentinel:) /view=need2; set work.data; by _all_; if first._sentinel; run; proc print data=need2; run;
@ChrisH10 wrote:
Thanks for your reply!
So in your code, if the main data set is titled "work.data", would I be replacing the "have" in data need1 with "work.data"? or do I replace need1 with work.data?
I have a two-part answer:
Great! Thanks for your help. I made a new data set called Work.Delinquentloans and used your code to narrow most of it down:
data WORK.DELINQUENTLOANS; set work.data; where PERIOD_END_LSTAT = "In Grace Period" or PERIOD_END_LSTAT = "Late (31-120 days)" or PERIOD_END_LSTAT = "Late (16-30 days)"; run; data need1 / view=need1; set WORK.DELINQUENTLOANS; retain _sentinel .; run; data need2 (drop=_sentinel:) /view=need2; set need1; by _all_ notsorted; if first._sentinel; run; proc print data=need2; run;
The last thing I realized running the new code is that there are a couple variables (INT_PAID, PRINCP_PAID and RECIEVED_AMT) that have values that need to be ignored for searching for duplicates. As you can see in Obs 3,4 & 5, it's the same loan but those specific variables are making it think it's three different loans. Any ideas on how to ignore some variables? Thanks again, it's much appreciated!
@ChrisH10 wrote:
Great! Thanks for your help. I made a new data set called Work.Delinquentloans and used your code to narrow most of it down:
data WORK.DELINQUENTLOANS; set work.data; where PERIOD_END_LSTAT = "In Grace Period" or PERIOD_END_LSTAT = "Late (31-120 days)" or PERIOD_END_LSTAT = "Late (16-30 days)"; run; data need1 / view=need1; set WORK.DELINQUENTLOANS; retain _sentinel .; run; data need2 (drop=_sentinel:) /view=need2; set need1; by _all_ notsorted; if first._sentinel; run; proc print data=need2; run;The last thing I realized running the new code is that there are a couple variables (INT_PAID, PRINCP_PAID and RECIEVED_AMT) that have values that need to be ignored for searching for duplicates. As you can see in Obs 3,4 & 5, it's the same loan but those specific variables are making it think it's three different loans. Any ideas on how to ignore some variables? Thanks again, it's much appreciated!
Pick a specific list of variable names that you do need to determine "loan". Replace the _all_ with that list in this data step.
data need2 (drop=_sentinel:) /view=need2; set need1; by _all_ notsorted; if first._sentinel; run;
So you now are telling us that there are a few variables that you want to ignore when detecting duplicate records. If there aren't too many variable to consider in the comparison of consecutive records, then list them in the DATA NEED2 step as @ballardw suggests. Or better yet, skip the DATA NEED2 step entirely and modify the DATA NEED1 step. Let's say you only need to compare 4 varaibles (A,B,C and D). Then you could:
data WORK.DELINQUENTLOANS;
	set work.data;
	where PERIOD_END_LSTAT = "In Grace Period" 
	or PERIOD_END_LSTAT = "Late (31-120 days)" 
	or PERIOD_END_LSTAT = "Late (16-30 days)";
run;
data need1 / view=need1;
  set WORK.DELINQUENTLOANS;
  by A B C D notsorted;
  if first.D;
run;
proc print data=need1;
run;But if you have a lot of variables to compare, and only 3 to ignore (then I would keep the "BY _ALL_" but control what _ALL_ actually represents, as in
data WORK.DELINQUENTLOANS;
	set work.data;
	where PERIOD_END_LSTAT = "In Grace Period" 
	or PERIOD_END_LSTAT = "Late (31-120 days)" 
	or PERIOD_END_LSTAT = "Late (16-30 days)";
run;
data need1 / view=need1;
  retain _sentinel1 .;
  if 0 then set work.delinquentloans (drop=INT_PAID, PRINCP_PAID RECIEVED_AMT);
  retain _sentinel2 .;
  set work.delinquentloans;
run;
 
data need2 (drop=_sentinel:) /view=need2;
  set need1;
  by _sentinel1--_sentinel2 notsorted;
  if first._sentinel2;
run;
proc print data=need2;
run;What's different here?:
Edit change: corrected RETAIN SENTINEL2 to RETAIN _SENTINEL2.
Thanks! I run into the error with the code (when trying your second piece of code), saying that:
ERROR: _sentinel2 does not follow _sentinel1 on the list of previously defined variables.
I'd assume I would change _sentinel2 to just sentinel2 right? to this:
data need2 (drop=_sentinel:) /view=need2; set need1; by _sentinel1--sentinel2 notsorted; if first.sentinel2; run;
Also, if I attempt to run the above code, it just creates a NEED2 with no filtered data. What should I do to fix this?
You correctly identified my error (now fixed in the relevant post). And yes, you suggesting of changing the spelling of _SENTINEL2 to SENTINEL2 will work. But I'd recommend the opposite (change SENTINEL2 to _SENTINEL2). That's because the statement
data need2 (drop=_sentinel:) / view=need2;
will unnecessarily allow SENTINEL2 to remain in data set NEED2. The "drop=_sentinel:" option removes all variables whose name begins with _SENTINEL (i.e. both _SENTINEL1 and _SENTINEL2). I use the option it to eliminate no-longer-useful information in the desired data set.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.
