topic Re: Filter observations based on two variables in SAS Programming

Filter observations based on two variables

curiosity — Mon, 05 Feb 2024 23:34:18 GMT

Hi,

I have the following sample data:

data have;
	input CompanyID	$ InvoiceAmount UserID $;
datalines;
A1	50 C1
A1	10	C7
A1	 5 	C3
B5	 100000 C3
B5	 10000 	C7
B5	 5000 	C2
B5	 1 	C1
D2	 100 	C2
D2	 50 	C2
D5	 1000 	C3
D5	 100 	C4
;
run;

Here is the dataset that I want:

data want;
	input CompanyID	$ InvoiceAmount UserID $;
datalines;
A1	 50 C1
B5	 5000 	C2
D2	 100 C2
D5	 1000 C3
;
run;

I have multiple companyID in my have dataset but i need just one entry of each companyID in my want dataset with the following order of preferences on which to choose.

Want should only have one row for each companyID
Want should take observation on each companyID that has userID of C1 or C2, if available.
If there are multiple C1 or C2 for a CompanyID, then take the observation that has the highest invoiceAmount of either C1 or C2.
If a specific companyID does not have userID of C1 or C2, then it should choose the observation that has the highest InvoiceAmount.

The data is already sorted on invoiceAmount by companyID.

Note: in the actual dataset theres about 10 million observations and each company might have 10 userIDs and C1 or C2 may be present in all 10 or in none of them.

Appreciate any help 🙂

Re: Filter observations based on two variables

Patrick — Mon, 05 Feb 2024 23:20:17 GMT

You could use code as below.

data have;
 input CompanyID $ InvoiceAmount UserID $;
datalines;
A1 50 C1
A1 10 C7
A1  5  C3
B5  100000 C3
B5  10000  C7
B5  5000  C2
B5  1  C1
D2  100  C2
D2  50  C2
D5  1000  C3
D5  100  C4
;

proc sql;
  create view v_inter as
  select *
  from have
  order by 
    CompanyID
    ,case when UserID in ('C1','C2') then 1 else 2 end
    ,InvoiceAmount DESC
  ;
quit;

data want;
  set v_inter;
  by CompanyID;
  if first.CompanyID;
run;

proc print data=want;
run;

Above code logic needs potentially some amendment for performance in case the data volumes in source table HAVE are really high or the table resides in a database. Please let us know if that's the case and provide further detail (data volumes/number of rows, database type and version).

Re: Filter observations based on two variables

curiosity — Mon, 05 Feb 2024 23:28:06 GMT

I meant the want should be:

data want;
	input CompanyID	$ InvoiceAmount UserID $;
datalines;
A1	 50 C1
B5	 5000 	C2
D2	 100 C2
D5	 1000 C3
;
run;

But I see Patrick already knew what I meant!

In my original version I had the condition that it should choose first the highest between C1 then if no C1 is present then to choose highest C2. Then if no (C1 or C2) then choose the highest invoiceAmount of any of the remainder userIDs. But for my requirement I realized I'm okay with the highest observation of C1 or C2 if either is present.

Re: Filter observations based on two variables

curiosity — Mon, 05 Feb 2024 23:31:34 GMT

Thank you for your reply.

Would you happen to know how to solve this using data steps intead of procSQL or is it that its more efficient using procsql? I dont have a lot of experience with procsql.. but i'll use it if its more efficient.

The dataset is about 10 million observations currently stored in a standard SAS 9.4 dataset.

Re: Filter observations based on two variables

Patrick — Mon, 05 Feb 2024 23:32:16 GMT

In my original version I had the condition that it should choose first the highest between C1 then if no C1 is present then to choose highest C2. Then if no (C1 or C2) then choose the highest invoiceAmount of any of the remainder userIDs. But for my requirement I realized I'm okay with the highest observation of C1 or C2 if either is present.

If you want to revert back to your original selection logic then just change the CASE logic

    ,case when UserID ='C1' then 1 when UserID ='C2' then 1 else 3 end

Re: Filter observations based on two variables

Patrick — Mon, 05 Feb 2024 23:34:48 GMT

@curiosity wrote:

Thank you for your reply.

Would you happen to know how to solve this using data steps intead of procSQL or is it that its more efficient using procsql? I dont have a lot of experience with procsql.. but i'll use it if its more efficient.

The dataset is about 10 million observations currently stored in a standard SAS 9.4 dataset.

SQL for the case here is more efficient from a coding/line of code perspective but it does internal sorting and sorting normally requires quite a bit of resources.

Re: Filter observations based on two variables

curiosity — Mon, 05 Feb 2024 23:42:44 GMT

If each companyID has say 10+ userIDs and C1 or C2 may be present in all 10 or in none of them, would that affect the code?

Re: Filter observations based on two variables

Patrick — Tue, 06 Feb 2024 09:44:31 GMT

Below code uses a data step and takes advantage of the known sort order of your source data.

Because there is no need for additional sorting this code version should outperform any other approach that requires sorting.

In the code below C1 customers get priority over C2 customers. If you don't care then just amend the code and combine the logic for the two to only populate and use a variable _point_c.

data have;
 input CompanyID $ InvoiceAmount UserID $;
datalines;
A1 50 C1
A1 10 C7
A1  5  C3
B5  100000 C3
B5  10000  C7
B5  5000  C2
B5  1  C1
D2  100  C2
D2  50  C2
D5  1000  C3
D5  100  C4
;

/* The data is already sorted on invoiceAmount by companyID. */
proc sort data=have;
  by CompanyID InvoiceAmount;
run;

data want(drop=_:);
  set have;
  by CompanyID InvoiceAmount;
  retain _point_c1 _point_c2;

  if UserID in ('C1') then _point_c1 = _n_;
  else 
  if UserID in ('C2') then _point_c2 = _n_;

  if last.CompanyID then
    do;
      _point=coalesce(_point_c1,_point_c2,_n_);
      set have point=_point;
      output;
      call missing(of _point_c:);
    end;
run;

proc print data=want;
run;

If each companyID has say 10+ userIDs and C1 or C2 may be present in all 10 or in none of them, would that affect the code?

The only thing that affects the code is that we need to treat C1 & C2 customers differently to the rest by giving them priority.

Re: Filter observations based on two variables

Ksharp — Tue, 06 Feb 2024 06:48:21 GMT

data have;
	input CompanyID	$ InvoiceAmount UserID $;
datalines;
A1	50 C1
A1	10	C7
A1	 5 	C3
B5	 100000 C3
B5	 10000 	C7
B5	 5000 	C2
B5	 1 	C1
D2	 100 	C2
D2	 50 	C2
D5	 1000 	C3
D5	 100 	C4
;
run;

data temp;
 set have;
 flag=not (UserID in ('C1' 'C2'));
run;
proc sort data=temp out=temp1 ;
by CompanyID flag descending InvoiceAmount;
run;
proc sort data=temp1 out=temp2 nodupkey;
by CompanyID flag ;
run;
proc sort data=temp2 out=want nodupkey;
by CompanyID  ;
run;

Re: Filter observations based on two variables

mkeintz — Tue, 06 Feb 2024 17:24:02 GMT

Sorting and judicious use of the "WHERE=" data set name parameter appears to be the simplest:

data have;
 input CompanyID $ InvoiceAmount UserID $;
datalines;
A1 50 C1
A1 10 C7
A1  5  C3
B5  100000 C3
B5  10000  C7
B5  5000  C2
B5  1  C1
D2  100  C2
D2  50  C2
D5  1000  C3
D5  100  C4
;

proc sort data=have out=need;
  by companyid descending invoiceamount;
run;

data want;
  set need (where=(userid in ('C1','C2'))      )
      need (where=(not(userid in ('C1','C2'))) );
  by companyid;
  if first.companyid;
run;

The where= params force all the "C1" or "C2" obs to precede all the others for a given companyid. And since the data are sorted by descending invoice, the where= means the largest C1/C2 invoice will be first, even if there are larger non C1/C2 observations.

Re: Filter observations based on two variables

mkeintz — Tue, 06 Feb 2024 18:24:58 GMT

If sorting by companyid/descending invoiceamount is expensive, or you just want to avaid a proc sort, you can keep track of the max value for each id, reserving a variable for C1/C2 obs and for all obs. At the end of the id, just re-read and output the obs that has the maximum qualifying invoice.

Of course, this assumes that the data are sorted by companyid:

data want (drop=_:);
  set have;
  by companyid;
  retain _max1  _n_max1  /*Track C!/C2 obs only*/
         _max2  _n_max2  /*Track all obs */;
  if first.companyid then call missing (of _:);
  if userid in ('C1','C2') and invoiceamount > _max1 then do;
     _max1=invoiceamount;
     _n_max1=_n_;
  end;
  if invoiceamount > _max2 then do;
     _max2=invoiceamount;
     _n_max2=_n_;
  end;
  if last.companyid;
  _n_max=coalesce(_n_max1,_n_max2);
  set have point=_n_max;
run;

Re: Filter observations based on two variables

s_lassen — Wed, 07 Feb 2024 06:46:44 GMT

As your data is already sorted by CompanyID and descending InvoiceAmount, this should do it:

data want;
  set 
    have(where=(UserID in('C1','C2')))
    have(where=(UserID not in('C1','C2')))
    ;
  by CompanyID;
  if first.CompanyID;
run;

Re: Filter observations based on two variables

curiosity — Thu, 08 Feb 2024 23:35:00 GMT

Appreciate all the suggestions!

I tested them all and they of course all worked.

I used the last one since it was the simplest for me to understand.

Thank you all 🙂