Hi,
I have the following sample data:
data have; input CompanyID $ InvoiceAmount UserID $; datalines; A1 50 C1 A1 10 C7 A1 5 C3 B5 100000 C3 B5 10000 C7 B5 5000 C2 B5 1 C1 D2 100 C2 D2 50 C2 D5 1000 C3 D5 100 C4 ; run;
Here is the dataset that I want:
data want; input CompanyID $ InvoiceAmount UserID $; datalines; A1 50 C1 B5 5000 C2 D2 100 C2 D5 1000 C3 ; run;
I have multiple companyID in my have dataset but i need just one entry of each companyID in my want dataset with the following order of preferences on which to choose.
The data is already sorted on invoiceAmount by companyID.
Note: in the actual dataset theres about 10 million observations and each company might have 10 userIDs and C1 or C2 may be present in all 10 or in none of them.
Appreciate any help 🙂
Sorting and judicious use of the "WHERE=" data set name parameter appears to be the simplest:
data have;
input CompanyID $ InvoiceAmount UserID $;
datalines;
A1 50 C1
A1 10 C7
A1 5 C3
B5 100000 C3
B5 10000 C7
B5 5000 C2
B5 1 C1
D2 100 C2
D2 50 C2
D5 1000 C3
D5 100 C4
;
proc sort data=have out=need;
by companyid descending invoiceamount;
run;
data want;
set need (where=(userid in ('C1','C2')) )
need (where=(not(userid in ('C1','C2'))) );
by companyid;
if first.companyid;
run;
The where= params force all the "C1" or "C2" obs to precede all the others for a given companyid. And since the data are sorted by descending invoice, the where= means the largest C1/C2 invoice will be first, even if there are larger non C1/C2 observations.
You could use code as below.
data have;
input CompanyID $ InvoiceAmount UserID $;
datalines;
A1 50 C1
A1 10 C7
A1 5 C3
B5 100000 C3
B5 10000 C7
B5 5000 C2
B5 1 C1
D2 100 C2
D2 50 C2
D5 1000 C3
D5 100 C4
;
proc sql;
create view v_inter as
select *
from have
order by
CompanyID
,case when UserID in ('C1','C2') then 1 else 2 end
,InvoiceAmount DESC
;
quit;
data want;
set v_inter;
by CompanyID;
if first.CompanyID;
run;
proc print data=want;
run;
Above code logic needs potentially some amendment for performance in case the data volumes in source table HAVE are really high or the table resides in a database. Please let us know if that's the case and provide further detail (data volumes/number of rows, database type and version).
Thank you for your reply.
Would you happen to know how to solve this using data steps intead of procSQL or is it that its more efficient using procsql? I dont have a lot of experience with procsql.. but i'll use it if its more efficient.
The dataset is about 10 million observations currently stored in a standard SAS 9.4 dataset.
@curiosity wrote:
Thank you for your reply.
Would you happen to know how to solve this using data steps intead of procSQL or is it that its more efficient using procsql? I dont have a lot of experience with procsql.. but i'll use it if its more efficient.
The dataset is about 10 million observations currently stored in a standard SAS 9.4 dataset.
SQL for the case here is more efficient from a coding/line of code perspective but it does internal sorting and sorting normally requires quite a bit of resources.
Below code uses a data step and takes advantage of the known sort order of your source data.
Because there is no need for additional sorting this code version should outperform any other approach that requires sorting.
In the code below C1 customers get priority over C2 customers. If you don't care then just amend the code and combine the logic for the two to only populate and use a variable _point_c.
data have;
input CompanyID $ InvoiceAmount UserID $;
datalines;
A1 50 C1
A1 10 C7
A1 5 C3
B5 100000 C3
B5 10000 C7
B5 5000 C2
B5 1 C1
D2 100 C2
D2 50 C2
D5 1000 C3
D5 100 C4
;
/* The data is already sorted on invoiceAmount by companyID. */
proc sort data=have;
by CompanyID InvoiceAmount;
run;
data want(drop=_:);
set have;
by CompanyID InvoiceAmount;
retain _point_c1 _point_c2;
if UserID in ('C1') then _point_c1 = _n_;
else
if UserID in ('C2') then _point_c2 = _n_;
if last.CompanyID then
do;
_point=coalesce(_point_c1,_point_c2,_n_);
set have point=_point;
output;
call missing(of _point_c:);
end;
run;
proc print data=want;
run;
If each companyID has say 10+ userIDs and C1 or C2 may be present in all 10 or in none of them, would that affect the code?
The only thing that affects the code is that we need to treat C1 & C2 customers differently to the rest by giving them priority.
I meant the want should be:
data want; input CompanyID $ InvoiceAmount UserID $; datalines; A1 50 C1 B5 5000 C2 D2 100 C2 D5 1000 C3 ; run;
But I see Patrick already knew what I meant!
In my original version I had the condition that it should choose first the highest between C1 then if no C1 is present then to choose highest C2. Then if no (C1 or C2) then choose the highest invoiceAmount of any of the remainder userIDs. But for my requirement I realized I'm okay with the highest observation of C1 or C2 if either is present.
In my original version I had the condition that it should choose first the highest between C1 then if no C1 is present then to choose highest C2. Then if no (C1 or C2) then choose the highest invoiceAmount of any of the remainder userIDs. But for my requirement I realized I'm okay with the highest observation of C1 or C2 if either is present.
If you want to revert back to your original selection logic then just change the CASE logic
,case when UserID ='C1' then 1 when UserID ='C2' then 1 else 3 end
data have;
input CompanyID $ InvoiceAmount UserID $;
datalines;
A1 50 C1
A1 10 C7
A1 5 C3
B5 100000 C3
B5 10000 C7
B5 5000 C2
B5 1 C1
D2 100 C2
D2 50 C2
D5 1000 C3
D5 100 C4
;
run;
data temp;
set have;
flag=not (UserID in ('C1' 'C2'));
run;
proc sort data=temp out=temp1 ;
by CompanyID flag descending InvoiceAmount;
run;
proc sort data=temp1 out=temp2 nodupkey;
by CompanyID flag ;
run;
proc sort data=temp2 out=want nodupkey;
by CompanyID ;
run;
Sorting and judicious use of the "WHERE=" data set name parameter appears to be the simplest:
data have;
input CompanyID $ InvoiceAmount UserID $;
datalines;
A1 50 C1
A1 10 C7
A1 5 C3
B5 100000 C3
B5 10000 C7
B5 5000 C2
B5 1 C1
D2 100 C2
D2 50 C2
D5 1000 C3
D5 100 C4
;
proc sort data=have out=need;
by companyid descending invoiceamount;
run;
data want;
set need (where=(userid in ('C1','C2')) )
need (where=(not(userid in ('C1','C2'))) );
by companyid;
if first.companyid;
run;
The where= params force all the "C1" or "C2" obs to precede all the others for a given companyid. And since the data are sorted by descending invoice, the where= means the largest C1/C2 invoice will be first, even if there are larger non C1/C2 observations.
If sorting by companyid/descending invoiceamount is expensive, or you just want to avaid a proc sort, you can keep track of the max value for each id, reserving a variable for C1/C2 obs and for all obs. At the end of the id, just re-read and output the obs that has the maximum qualifying invoice.
Of course, this assumes that the data are sorted by companyid:
data want (drop=_:);
set have;
by companyid;
retain _max1 _n_max1 /*Track C!/C2 obs only*/
_max2 _n_max2 /*Track all obs */;
if first.companyid then call missing (of _:);
if userid in ('C1','C2') and invoiceamount > _max1 then do;
_max1=invoiceamount;
_n_max1=_n_;
end;
if invoiceamount > _max2 then do;
_max2=invoiceamount;
_n_max2=_n_;
end;
if last.companyid;
_n_max=coalesce(_n_max1,_n_max2);
set have point=_n_max;
run;
As your data is already sorted by CompanyID and descending InvoiceAmount, this should do it:
data want;
set
have(where=(UserID in('C1','C2')))
have(where=(UserID not in('C1','C2')))
;
by CompanyID;
if first.CompanyID;
run;
Appreciate all the suggestions!
I tested them all and they of course all worked.
I used the last one since it was the simplest for me to understand.
Thank you all 🙂
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.