I stuck in a situation in an example:
data xyz;
input ID:$6. Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;
proc sql feedback;
select *, count(ID) as Total from Work.xyz group by Symbol, ID;
quit;
OUTPUT:
Instead of the above output, I want only this one as output:
Note: I need a solution for 1 million data, above is just an example. Total value can be different.
Someone help me out.
@novinosrin: I think that you're correct and I misunderstood what @Saurabh291989 was saying.
As such, here is a variant of @Ksharp's code that produces the same result as your code. On a dataset with 3.5 million records it ran 26% faster than your code. Probably irrelevant difference for @Saurabh291989, as your code ran in 2.45 seconds, while this code ran in 1.91 seconds. However, it also required less time to type and, for me at least, easier to follow:
data xyz;
input ID:$6. Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC105 XY
ABC105 PQ
ABC105 YX
ABC105 AB
;
run;
data need;
set xyz;
position=_n_;
run;
proc sql;
create table want (drop=position) as
select * from
(select * from
(select ID, symbol, position, count(ID) as count
from need
group by ID, Symbol
)
group by id
having count=max(count)
)
group by id
having position=min(position)
;
quit;
The code I ran to compare the two sets of suggested code was:
data xyz (drop=i _id);
input _ID Symbol:$2. @@;
do i=1 to 50000;
id=(_id-1)*50000+i;
output;
end;
datalines;
1 YX 1 XY
1 XY 1 YX
1 XY 1 YX
1 XY 1 YX
1 QP
2 PQ 2 QP
2 PQ 2 QP
2 PQ 2 QP
2 PQ 2 QP
3 IJ 3 JI
3 IJ 3 JI
3 IJ 3 JI
3 IJ 3 JI
4 UV 4 VU
4 UV 4 VU
4 UV 4 VU
4 UV 4 VU
5 XY 5 XY
5 XY 5 XY
5 PQ 5 PQ
5 PQ 5 PQ
5 YX 5 YX
5 YX 5 YX
5 AB 5 AB
5 AB 5 AB
6 PQ 6 QP
6 PQ 6 QP
6 AB 6 BA
6 AB 6 BA
6 CD 6 DC
6 CD 6 DC
6 CD 6 DC
7 AX 7 AD
7 A1 7 A0
;
run;
proc sort data=xyz;
by id;
run;
data need;
set xyz;
position=_n_;
run;
proc sql;
create table want (drop=position) as
select * from
(select * from
(select ID, symbol, position, count(ID) as count
from need
group by ID, Symbol
)
group by id
having count=max(count)
)
group by id
having position=min(position)
;
quit;
data wantN;
/*initilise hash object with key vars id and symbol and count as data var */
if _n_ = 1 then do;
declare hash h();
h.defineKey('id','symbol');
h.definedata('count');
h.defineDone();
end;
/*first pass of the dataset to get the count of combination of id and symbol
and load in the hash object*/
do until (l);
set xyz end=l;
if h.find() ne 0 then do;
count=1;
h.replace();
end;
else do;
count=count+1;
h.replace();
end;
end;
l=0;
/*second pass using doubl DOW, 1st dow to get the max of all combinations
with id */
do until(l);
m=.;
do until(last.id);
set xyz end=l;
by id;
rc=h.find();
m=max(count,m);
end;
f=0;
/*this will output the required max top value if ties or
just the max if no ties within an id group*/
do until(last.id);
set xyz ;
by id;
rc=h.find();
if m=count and not f then do;
output;
f=1;
end;
end;
end;
keep id symbol count;
run;
Art, CEO, AnalystFinder.com
Art, CEO, AnalystFinder.com
It's impossible to provide a generalized solution without more information.
Are the 2nd symbol values always mirror images of the first symbol value (example: UV and VU)?
Can there be more than one mirror image pair of symbols in a given ID? (For example, can ABC104 have UV and VU and BG and GB and TR and RT?)
Are the letters in the symbols always adjacent to one another in the alphabet? (V is adjacent to U, but can V appear with A?)
Does this solution HAVE to be in SQL? Because even if you say "YES", I would likely provide a non-SQL solution if I think of one.
@Saurabh291989 wrote:
@Paige Miller
No, letters in the symbol are not adjacent. They're random in nature. The combination can be any within 26 alphabets and alphanumeric.
So I asked you to clarify four different points regarding your data, and you clarified one of the four.
Diagnose it. That is because you have the following distinct combinations
proc sql ;
select distinct id, symbol from xyz;
*select *, count(ID) as Total from Work.xyz group by Symbol, ID;
quit;
ID Symbol
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ABC101 XY
ABC101 YX
ABC102 PQ
ABC102 QP
ABC103 IJ
ABC103 JI
ABC104 UV
ABC104 VU
So you basically have 8 different groups as a combination of id and symbol
I think SQL permits SUBSTR. If so, add to the SELECT statement:
where substr(symbol, 1, 1) <= substr(symbol, 2, 1)
data xyz;
input ID:$6. Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 VU
;
run;
/*i added one more row to test for non ties ABC104 VU */
/*if count values get tied in each group then it returns only top row i.e. 1st one,
since proc sql order is not guaranteed,creating a counter column n to the dataset "as is".*/
data temp;
set xyz;
by id;
if first.id then n=1;
else n+1;
run;
proc sql;
create table want(drop=n f) as
select *, case when std(total)=0 then min(n) else max(total) end as f
from
(select *, count(ID) as Total from Work.temp group by Symbol, ID
having n=min(n))
group by id
having (n=f) or (f=total);
quit;
I also added an extra record and, with my test data, @novinosrin's code didn't produce the desired result. The following requires an extra pass of the data, but with only a million records, shouldn't take very long:
data xyz;
input ID:$6. Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;
data for4mat (drop=i j k);
retain fmtname 'symbols' type 'C';
length start $2;
length label $2;
array astart(36) $2 _temporary_;
do i=65 to 90,48 to 57;
j+1;
astart(j)=byte(i);
end;
k=0;
do i=1 to 36;
do j=i+1 to 36;
start=catt(astart(i),astart(j));
label=catt(astart(i),astart(j));
output;
start=catt(astart(j),astart(i));
label=catt(astart(i),astart(j));
output;
end;
end;
run;
proc format cntlin = for4mat;
run;
data need;
set xyz;
length symbol2 $2;
symbol2=put(symbol,$symbols.);
run;
data need (drop=symbol2 test);
set need;
by id symbol2 notsorted;
length test $2;
retain test;
if first.symbol2 then test=symbol;
if symbol eq test;
run;
proc sql;
create table want as
select distinct *, count(ID) as Total
from need
group by ID, Symbol
;
quit;
Art, CEO, AnalystFinder.com
Thank you @art297 and sorry to bother with a request on a weekend, would a tweak
to
having (n=f) or (f=total and n=min(n)); suffice?
EDITED after some serious test: Won't suffice if there is another combination at n=1.. Pardon me.Sorry about that
data xyz;
input ID:$6. Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;
data want;
if _n_ = 1 then do;
declare hash h();
h.defineKey('id','symbol');
h.definedata('count');
h.defineDone();
end;
do until (l);
set xyz end=l;
if h.find() ne 0 then do;count=1;h.replace();end;
else do;count=count+1;h.replace();end;
end;
l=0;
do until(l);
m=.;
do until(last.id);
set xyz end=l;
by id;
rc=h.find();
m=max(count,m);
end;
f=0;
do until(last.id);
set xyz ;
by id;
rc=h.find();
if m=count and not f then do; output;f=1;end;
end;
end;
keep id symbol count;
run;
@novinosrin: Your code needs an adjustment as it's losing the fifth record, i.e.: ABC101 QP
Interestingly, on the small test file, my approach runs faster. Be interesting to see how the two approaches compare on a million record file.
Art, CEO, AnalystFinder.com
Good evening @art297 my understanding is that you can't have duplicate ID in your result and the code is supposed to check for max of counts of combination of groups and pick the earliest group combination within an ID that has max of the counts. Your code is picking two different counts for the same ID ABC101. I am not sure if that is what is desired.
Basically,if my understanding is correct , the algorithm would have to check for
1. compute the counts with combination of id, symbol group by
2. get the max of counts within a group id and not combination
3. if no ties, easy pick that else pick the earliest combination that satisfies the max count
ABC101 QP has a count of 1 whilst other ABC101 has greater counts. So i think that anyway doesn't meet the max count need.
PS
It always makes me nervous to to be part of thread where champs like you are part of. Sorry if any of the above doesn't make sense
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.