@novinosrin: Makes perfect sense! Either you're right, I am, or neither of us are.
Don't be nervous! You have picked up so much in the short time you've been using/learning SAS that one would be hard pressed to distinguish your posts from any of the old timers. In fact, your code has often been the best proposed.
Art, CEO, AnalystFinder.com
@novinosrin:
Yes, @novinosrin, you understood my problem correctly.
Your quotes:
" Basically,if my understanding is correct , the algorithm would have to check for
1. compute the counts with combination of id, symbol group by
2. get the max of counts within a group id and not combination
3. if no ties, easy pick that else pick the earliest combination that satisfies the max count"
That's what I'm looking for.
Just have one question: given the following example dataset, what result would you expect for ID=ABC101?
data xyz;
input ID:$6. Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;
Art, CEO, AnalystFinder.com
@Saurabh291989: In that case you still don't have a solution as none of the current offerings would provide that result.
Art, CEO, AnalystFinder.com
@Saurabh291989 Have you tested this yet? I added some comments
Morning @art297 I thought this meets OP's req. Have you noticed anywhere going off the rails?
data xyz;
input ID:$6. Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;
data want;
/*initilise hash object with key vars id and symbol and count as data var */
if _n_ = 1 then do;
declare hash h();
h.defineKey('id','symbol');
h.definedata('count');
h.defineDone();
end;
/*first pass of the dataset to get the count of combination of id and symbol
and load in the hash object*/
do until (l);
set xyz end=l;
if h.find() ne 0 then do;count=1;h.replace();end;
else do;count=count+1;h.replace();end;
end;
l=0;
/*second pass using doubl DOW, 1st dow to get the max of all combinations
with id */
do until(l);
m=.;
do until(last.id);
set xyz end=l;
by id;
rc=h.find();
m=max(count,m);
end;
f=0;
/*this will output the required max top value if ties or
just the max if no ties within an id group*/
do until(last.id);
set xyz ;
by id;
rc=h.find();
if m=count and not f then do; output;f=1;end;
end;
end;
keep id symbol count;
run;
@novinosrin: Yes I have identified a case where it doesn't work .. the first one ID=ABC101.
If I correctly understand @Saurabh291989's latest post, it should have a count of 5, where your code results in a count of 4.
Also, here is an example that uses the above dataset, but adds the new example that was included in @Saurabh291989's latest post. Your code produces a count of 1, rather than 4, for that new example:
data xyz;
input ID:$6. Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC105 XY
ABC105 PQ
ABC105 YX
ABC105 AB
;
run;
data want;
/*initilise hash object with key vars id and symbol and count as data var */
if _n_ = 1 then do;
declare hash h();
h.defineKey('id','symbol');
h.definedata('count');
h.defineDone();
end;
/*first pass of the dataset to get the count of combination of id and symbol
and load in the hash object*/
do until (l);
set xyz end=l;
if h.find() ne 0 then do;count=1;h.replace();end;
else do;count=count+1;h.replace();end;
end;
l=0;
/*second pass using doubl DOW, 1st dow to get the max of all combinations
with id */
do until(l);
m=.;
do until(last.id);
set xyz end=l;
by id;
rc=h.find();
m=max(count,m);
end;
f=0;
/*this will output the required max top value if ties or
just the max if no ties within an id group*/
do until(last.id);
set xyz ;
by id;
rc=h.find();
if m=count and not f then do; output;f=1;end;
end;
end;
keep id symbol count;
run;
which results in the following:
Art, CEO, AnalystFinder.com
My understanding was/is
1. compute the counts with combination of id, symbol group by
This means the combination count will result exactly as it did
then compare the count across the id and within ( id makes it parent group)
ABC101 has XY,YX,QP child groups XY count=4, YX=Count=4 and QP=1 the max(,XY,YX,QP) within ABC101 is 4
But the catch is there is a tie between XY and YX. However OP says pick the first record within the ID group. And in this case XY qualifies.
@novinosrin: I think that you're correct and I misunderstood what @Saurabh291989 was saying.
As such, here is a variant of @Ksharp's code that produces the same result as your code. On a dataset with 3.5 million records it ran 26% faster than your code. Probably irrelevant difference for @Saurabh291989, as your code ran in 2.45 seconds, while this code ran in 1.91 seconds. However, it also required less time to type and, for me at least, easier to follow:
data xyz;
input ID:$6. Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC105 XY
ABC105 PQ
ABC105 YX
ABC105 AB
;
run;
data need;
set xyz;
position=_n_;
run;
proc sql;
create table want (drop=position) as
select * from
(select * from
(select ID, symbol, position, count(ID) as count
from need
group by ID, Symbol
)
group by id
having count=max(count)
)
group by id
having position=min(position)
;
quit;
The code I ran to compare the two sets of suggested code was:
data xyz (drop=i _id);
input _ID Symbol:$2. @@;
do i=1 to 50000;
id=(_id-1)*50000+i;
output;
end;
datalines;
1 YX 1 XY
1 XY 1 YX
1 XY 1 YX
1 XY 1 YX
1 QP
2 PQ 2 QP
2 PQ 2 QP
2 PQ 2 QP
2 PQ 2 QP
3 IJ 3 JI
3 IJ 3 JI
3 IJ 3 JI
3 IJ 3 JI
4 UV 4 VU
4 UV 4 VU
4 UV 4 VU
4 UV 4 VU
5 XY 5 XY
5 XY 5 XY
5 PQ 5 PQ
5 PQ 5 PQ
5 YX 5 YX
5 YX 5 YX
5 AB 5 AB
5 AB 5 AB
6 PQ 6 QP
6 PQ 6 QP
6 AB 6 BA
6 AB 6 BA
6 CD 6 DC
6 CD 6 DC
6 CD 6 DC
7 AX 7 AD
7 A1 7 A0
;
run;
proc sort data=xyz;
by id;
run;
data need;
set xyz;
position=_n_;
run;
proc sql;
create table want (drop=position) as
select * from
(select * from
(select ID, symbol, position, count(ID) as count
from need
group by ID, Symbol
)
group by id
having count=max(count)
)
group by id
having position=min(position)
;
quit;
data wantN;
/*initilise hash object with key vars id and symbol and count as data var */
if _n_ = 1 then do;
declare hash h();
h.defineKey('id','symbol');
h.definedata('count');
h.defineDone();
end;
/*first pass of the dataset to get the count of combination of id and symbol
and load in the hash object*/
do until (l);
set xyz end=l;
if h.find() ne 0 then do;
count=1;
h.replace();
end;
else do;
count=count+1;
h.replace();
end;
end;
l=0;
/*second pass using doubl DOW, 1st dow to get the max of all combinations
with id */
do until(l);
m=.;
do until(last.id);
set xyz end=l;
by id;
rc=h.find();
m=max(count,m);
end;
f=0;
/*this will output the required max top value if ties or
just the max if no ties within an id group*/
do until(last.id);
set xyz ;
by id;
rc=h.find();
if m=count and not f then do;
output;
f=1;
end;
end;
end;
keep id symbol count;
run;
Art, CEO, AnalystFinder.com
Art, CEO, AnalystFinder.com
Is it that necessary to use SQL ?
data xyz;
input ID:$6. Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;
proc sql feedback number;
select * from
(select *, count(ID) as Total from Work.xyz group by Symbol, ID)
group by id
having symbol=min(symbol);
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.