topic Re: De-duplicate value from column based on highest value from another column in SAS Programming

De-duplicate value from column based on highest value from another column

Justin9 — Sun, 27 Feb 2022 22:04:56 GMT

Hi, I've got a dataset (called 'RC.DATA_CHECK') that has two columns: Quarter and Rate. Please can someone provide me with the required code that would only keep one version of the Quarter value, based on the highest value of Rate? Note that 2021Q1 has three records, two of which are the same, but I would just like to keep one of these rows for this quarter. I have included a screenshot below of what I would like my dataset to look like after the dataset has been de-duplicated.

Re: De-duplicate value from column based on highest value from another column

PaigeMiller — Mon, 28 Feb 2022 11:16:33 GMT

Find the maximum value of rate, by quarter:

proc summary data=have nway;
    class quarter;
    var rate;
    output out=want max=;
run;

Re: De-duplicate value from column based on highest value from another column

Justin9 — Sun, 27 Feb 2022 22:22:38 GMT

Thanks for your reply. I've just realised that I've got another dataset that has a similar problem to what I posted, only that it has about an extra 100 columns along with the two (quarter and rate). If I still wanted to do the same thing (keep one record of quarter based on the highest value of rate, how would the code change (i.e. I want to have all of the c.100 columns in my dataset, with the only change being that only one unique quarter (based on highest rate) is shown with all of the information from the other c.100 columns for this row)?

Re: De-duplicate value from column based on highest value from another column

mkeintz — Sun, 27 Feb 2022 22:41:03 GMT

Your data appear to be sorted by quarter. If so, then:

data want (drop=max_rate);
  do until (last.quarter);
    set have;
    by quarter;
    max_rate=max(max_rate,rate);
  end;
  do until (last.quarter);
    set have;
    by quarter;
    if rate=max_rate then output;
  end;
run;

As long as no quarter has multiple observations with the maximum rate, this will produce what you want.

It works by reading all obs for each quarter twice. The first time to determine the maximum rate. The second time to re-read and output the obs with that maximum.

Re: De-duplicate value from column based on highest value from another column

Justin9 — Sun, 27 Feb 2022 22:46:06 GMT

Thanks for your post. My example had 2021Q1 with two observations with the maximum rate (see original post), so does that impact on the code that you've just posted?

Re: De-duplicate value from column based on highest value from another column

Justin9 — Mon, 28 Feb 2022 01:47:06 GMT

Hi, can someone please tell me how the code would change if I wanted the same criteria (keep only one quarter based on the highest value of rate), but there are 100 variables/columns in the dataset (rather than just the two)?

Re: De-duplicate value from column based on highest value from another column

Tom — Mon, 28 Feb 2022 02:12:34 GMT

Sort the data and use FIRST. processing.

proc sort data=have out=sorted;
  by quarter descending rate;
run;

data want;
  set sorted;
  by quarter;
  if first.quarter;
run;

If there are variables you can use to help decide which one of the ties to pick add those to the BY statement in the proc sort step.

Re: De-duplicate value from column based on highest value from another column

mkeintz — Mon, 28 Feb 2022 02:43:46 GMT

@Justin9 wrote:

Thanks for your post. My example had 2021Q1 with two observations with the maximum rate (see original post), so does that impact on the code that you've just posted?

Then which record with the tied rates do you want?