BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
curiosity
Obsidian | Level 7

Hi, 

 

I have the following sample data:

 

 

data have;
	input CompanyID	$ InvoiceAmount UserID $;
datalines;
A1	50 C1
A1	10	C7
A1	 5 	C3
B5	 100000 C3
B5	 10000 	C7
B5	 5000 	C2
B5	 1 	C1
D2	 100 	C2
D2	 50 	C2
D5	 1000 	C3
D5	 100 	C4
;
run;

 

Here is the dataset that I want:

data want;
	input CompanyID	$ InvoiceAmount UserID $;
datalines;
A1	 50 C1
B5	 5000 	C2
D2	 100 C2
D5	 1000 C3
;
run;

 

I have multiple companyID in my have dataset but i need just one entry of each companyID in my want dataset with the following order of preferences on which to choose.

  1. Want should only have one row for each companyID
  2. Want should take observation on each companyID that has userID of C1 or C2, if available.
    If there are multiple C1 or C2 for a CompanyID, then take the observation that has the highest invoiceAmount of either C1 or C2.
  3. If a specific companyID does not have userID of C1 or C2, then it should choose the observation that has the highest InvoiceAmount.

The data is already sorted on invoiceAmount by companyID.

 

Note: in the actual dataset theres about 10 million observations and each company might have 10 userIDs and C1 or C2 may be present in all 10 or in none of them.

 

Appreciate any help 🙂 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Sorting and judicious use of the "WHERE=" data set name parameter appears to be the simplest:

 

data have;
 input CompanyID $ InvoiceAmount UserID $;
datalines;
A1 50 C1
A1 10 C7
A1  5  C3
B5  100000 C3
B5  10000  C7
B5  5000  C2
B5  1  C1
D2  100  C2
D2  50  C2
D5  1000  C3
D5  100  C4
;

proc sort data=have out=need;
  by companyid descending invoiceamount;
run;

data want;
  set need (where=(userid in ('C1','C2'))      )
      need (where=(not(userid in ('C1','C2'))) );
  by companyid;
  if first.companyid;
run;

The where= params force all the "C1" or "C2" obs to precede all the others for a given companyid.  And since the data are sorted by descending invoice, the where= means the largest C1/C2 invoice will be first, even if there are larger non C1/C2 observations.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

12 REPLIES 12
Patrick
Opal | Level 21

You could use code as below.

data have;
 input CompanyID $ InvoiceAmount UserID $;
datalines;
A1 50 C1
A1 10 C7
A1  5  C3
B5  100000 C3
B5  10000  C7
B5  5000  C2
B5  1  C1
D2  100  C2
D2  50  C2
D5  1000  C3
D5  100  C4
;

proc sql;
  create view v_inter as
  select *
  from have
  order by 
    CompanyID
    ,case when UserID in ('C1','C2') then 1 else 2 end
    ,InvoiceAmount DESC
  ;
quit;

data want;
  set v_inter;
  by CompanyID;
  if first.CompanyID;
run;

proc print data=want;
run;

Patrick_0-1707175039643.png

 

Above code logic needs potentially some amendment for performance in case the data volumes in source table HAVE are really high or the table resides in a database. Please let us know if that's the case and provide further detail (data volumes/number of rows, database type and version).

 

 

curiosity
Obsidian | Level 7

Thank you for your reply.

 

Would you happen to know how to solve this using data steps intead of procSQL or is it that its more efficient using procsql? I dont have a lot of experience with procsql.. but i'll use it if its more efficient.

 

The dataset is about 10 million observations currently stored in a standard SAS 9.4 dataset.

Patrick
Opal | Level 21

@curiosity wrote:

Thank you for your reply.

 

Would you happen to know how to solve this using data steps intead of procSQL or is it that its more efficient using procsql? I dont have a lot of experience with procsql.. but i'll use it if its more efficient.

 

The dataset is about 10 million observations currently stored in a standard SAS 9.4 dataset.


SQL for the case here is more efficient from a coding/line of code perspective but it does internal sorting and sorting normally requires quite a bit of resources.

Patrick
Opal | Level 21

Below code uses a data step and takes advantage of the known sort order of your source data.

Because there is no need for additional sorting this code version should outperform any other approach that requires sorting.

In the code below C1 customers get priority over C2 customers. If you don't care then just amend the code and combine the logic for the two to only populate and use a variable _point_c

data have;
 input CompanyID $ InvoiceAmount UserID $;
datalines;
A1 50 C1
A1 10 C7
A1  5  C3
B5  100000 C3
B5  10000  C7
B5  5000  C2
B5  1  C1
D2  100  C2
D2  50  C2
D5  1000  C3
D5  100  C4
;

/* The data is already sorted on invoiceAmount by companyID. */
proc sort data=have;
  by CompanyID InvoiceAmount;
run;

data want(drop=_:);
  set have;
  by CompanyID InvoiceAmount;
  retain _point_c1 _point_c2;

  if UserID in ('C1') then _point_c1 = _n_;
  else 
  if UserID in ('C2') then _point_c2 = _n_;

  if last.CompanyID then
    do;
      _point=coalesce(_point_c1,_point_c2,_n_);
      set have point=_point;
      output;
      call missing(of _point_c:);
    end;
run;

proc print data=want;
run;

 

If each companyID has say 10+ userIDs and C1 or C2 may be present in all 10 or in none of them, would that affect the code?

The only thing that affects the code is that we need to treat C1 & C2 customers differently to the rest by giving them priority.

curiosity
Obsidian | Level 7

I meant the want should be:

data want;
	input CompanyID	$ InvoiceAmount UserID $;
datalines;
A1	 50 C1
B5	 5000 	C2
D2	 100 C2
D5	 1000 C3
;
run;

 

But I see Patrick already knew what I meant! 


In my original version I had the condition that it should choose first the highest between C1 then if no C1 is present then to choose highest C2. Then if no (C1 or C2) then choose the highest invoiceAmount of any of the remainder userIDs.   But for my requirement I realized I'm okay with the highest observation of C1 or C2 if either is present.

 

 

Patrick
Opal | Level 21

In my original version I had the condition that it should choose first the highest between C1 then if no C1 is present then to choose highest C2. Then if no (C1 or C2) then choose the highest invoiceAmount of any of the remainder userIDs.   But for my requirement I realized I'm okay with the highest observation of C1 or C2 if either is present.


If you want to revert back to your original selection logic then just change the CASE logic

    ,case when UserID ='C1' then 1 when UserID ='C2' then 1 else 3 end
curiosity
Obsidian | Level 7
If each companyID has say 10+ userIDs and C1 or C2 may be present in all 10 or in none of them, would that affect the code?
Ksharp
Super User
data have;
	input CompanyID	$ InvoiceAmount UserID $;
datalines;
A1	50 C1
A1	10	C7
A1	 5 	C3
B5	 100000 C3
B5	 10000 	C7
B5	 5000 	C2
B5	 1 	C1
D2	 100 	C2
D2	 50 	C2
D5	 1000 	C3
D5	 100 	C4
;
run;

data temp;
 set have;
 flag=not (UserID in ('C1' 'C2'));
run;
proc sort data=temp out=temp1 ;
by CompanyID flag descending InvoiceAmount;
run;
proc sort data=temp1 out=temp2 nodupkey;
by CompanyID flag ;
run;
proc sort data=temp2 out=want nodupkey;
by CompanyID  ;
run;
mkeintz
PROC Star

Sorting and judicious use of the "WHERE=" data set name parameter appears to be the simplest:

 

data have;
 input CompanyID $ InvoiceAmount UserID $;
datalines;
A1 50 C1
A1 10 C7
A1  5  C3
B5  100000 C3
B5  10000  C7
B5  5000  C2
B5  1  C1
D2  100  C2
D2  50  C2
D5  1000  C3
D5  100  C4
;

proc sort data=have out=need;
  by companyid descending invoiceamount;
run;

data want;
  set need (where=(userid in ('C1','C2'))      )
      need (where=(not(userid in ('C1','C2'))) );
  by companyid;
  if first.companyid;
run;

The where= params force all the "C1" or "C2" obs to precede all the others for a given companyid.  And since the data are sorted by descending invoice, the where= means the largest C1/C2 invoice will be first, even if there are larger non C1/C2 observations.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

If sorting by companyid/descending invoiceamount is expensive, or you just want to avaid a proc sort, you can keep track of the max value for each id, reserving a variable for C1/C2 obs and for all obs.  At the end of the id, just re-read and output the obs that has the maximum qualifying invoice.

 

Of course, this assumes that the data are sorted by companyid:

 

data want (drop=_:);
  set have;
  by companyid;
  retain _max1  _n_max1  /*Track C!/C2 obs only*/
         _max2  _n_max2  /*Track all obs */;
  if first.companyid then call missing (of _:);
  if userid in ('C1','C2') and invoiceamount > _max1 then do;
     _max1=invoiceamount;
     _n_max1=_n_;
  end;
  if invoiceamount > _max2 then do;
     _max2=invoiceamount;
     _n_max2=_n_;
  end;
  if last.companyid;
  _n_max=coalesce(_n_max1,_n_max2);
  set have point=_n_max;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
s_lassen
Meteorite | Level 14

As your data is already sorted by CompanyID and descending InvoiceAmount, this should do it:

data want;
  set 
    have(where=(UserID in('C1','C2')))
    have(where=(UserID not in('C1','C2')))
    ;
  by CompanyID;
  if first.CompanyID;
run;
curiosity
Obsidian | Level 7

Appreciate all the suggestions!

 

I tested them all and they of course all worked.

 

I used the last one since it was the simplest for me to understand.

 

Thank you all 🙂

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 992 views
  • 11 likes
  • 5 in conversation