BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
art297
Opal | Level 21

@novinosrin: Makes perfect sense! Either you're right, I am, or neither of us are.

 

Don't be nervous! You have picked up so much in the short time you've been using/learning SAS that one would be hard pressed to distinguish your posts from any of the old timers. In fact, your code has often been the best proposed.

 

Art, CEO, AnalystFinder.com

 

Saurabh291989
Calcite | Level 5

@novinosrin:
Yes, @novinosrin, you understood my problem correctly.

 

Your quotes:

" Basically,if my understanding is correct , the algorithm would have to check for

1. compute the counts with combination of id, symbol group by

2. get the max of counts within a group id  and not combination

3. if no ties, easy pick that else pick the earliest combination that satisfies the max count"

 

That's what I'm looking for.

art297
Opal | Level 21

Just have one question: given the following example dataset, what result would you expect for ID=ABC101?

data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;

Art, CEO, AnalystFinder.com

 

Saurabh291989
Calcite | Level 5
@ art297:
Actually, I'm looking for the topmost value for the ABC101 i.e. it doesn't matter whether it is XY, YX or QP. Only matters, it should be maxed in the count and if a group count values tie, it selects only the 1st value.
example:
ID SYMBOL COUNT
ABC101 XY 4
ABC101 PQ 4
ABC101 YX 4
ABC101 AB 4

if this situation arrived only top value showed up in the result i.e. ABC101 XY 4
art297
Opal | Level 21

@Saurabh291989:  In that case you still don't have a solution as none of the current offerings would provide that result.

 

Art, CEO, AnalystFinder.com

 

 

novinosrin
Tourmaline | Level 20

@Saurabh291989  Have you tested this yet? I added some comments 

 

Morning @art297  I thought this meets OP's req. Have you noticed anywhere going off the rails?

data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;

data want;
/*initilise hash object with key vars id and symbol and count as data var */
 if _n_ = 1 then do;
   declare hash h();
   h.defineKey('id','symbol');
   h.definedata('count');
   h.defineDone();
 end;
/*first pass of the dataset to get the count of combination of id and symbol
 and load in the hash object*/
do until (l);
   set xyz end=l;
  if h.find() ne 0 then do;count=1;h.replace();end;
  else do;count=count+1;h.replace();end;
end; 
l=0;
/*second pass using doubl DOW, 1st dow to get the max of all combinations
with id */
do until(l);
	m=.;
	do until(last.id);
		set xyz end=l;
		by id;
		rc=h.find();
 		m=max(count,m);
	end; 
		f=0;
/*this will output the required max top value if ties or 
just the max if no ties within an id group*/
	do until(last.id);
		set xyz ;
		by id;	
		rc=h.find();
		if m=count and  not f then do; output;f=1;end;
	end;
end;
keep id symbol count;
run;
art297
Opal | Level 21

@novinosrin: Yes I have identified a case where it doesn't work .. the first one ID=ABC101.

 

If I correctly understand @Saurabh291989's latest post, it should have a count of 5, where your code results in a count of 4.

 

Also, here is an example that uses the above dataset, but adds the new example that was included in @Saurabh291989's latest post. Your code produces a count of 1, rather than 4, for that new example:

 

data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC105 XY
ABC105 PQ
ABC105 YX
ABC105 AB
;
run;

data want;
/*initilise hash object with key vars id and symbol and count as data var */
 if _n_ = 1 then do;
   declare hash h();
   h.defineKey('id','symbol');
   h.definedata('count');
   h.defineDone();
 end;
/*first pass of the dataset to get the count of combination of id and symbol
 and load in the hash object*/
do until (l);
   set xyz end=l;
  if h.find() ne 0 then do;count=1;h.replace();end;
  else do;count=count+1;h.replace();end;
end; 
l=0;
/*second pass using doubl DOW, 1st dow to get the max of all combinations
with id */
do until(l);
    m=.;
    do until(last.id);
        set xyz end=l;
        by id;
        rc=h.find();
        m=max(count,m);
    end; 
        f=0;
/*this will output the required max top value if ties or 
just the max if no ties within an id group*/
    do until(last.id);
        set xyz ;
        by id;  
        rc=h.find();
        if m=count and  not f then do; output;f=1;end;
    end;
end;
keep id symbol count;
run;

which results in the following:

Capture.JPG

 

Art, CEO, AnalystFinder.com

 

novinosrin
Tourmaline | Level 20

@art297 

 

My understanding was/is

1. compute the counts with combination of id, symbol group by

This means the combination count will result exactly as it did

then compare the count across  the id and within ( id makes it parent group) 

ABC101 has XY,YX,QP child groups XY count=4, YX=Count=4 and QP=1 the max(,XY,YX,QP) within ABC101 is 4

 

But the catch is there is a tie between XY and YX. However OP says pick the first record within the ID group. And in this case XY qualifies. 

art297
Opal | Level 21

@novinosrin: I think that you're correct and I misunderstood what @Saurabh291989 was saying.

 

As such, here is a variant of @Ksharp's code that produces the same result as your code. On a dataset with 3.5 million records it ran 26% faster than your code. Probably irrelevant difference for @Saurabh291989, as your code ran in 2.45 seconds, while this code ran in 1.91 seconds. However, it also required less time to type and, for me at least, easier to follow:

 

data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC105 XY
ABC105 PQ
ABC105 YX
ABC105 AB
;
run;

data need;
  set xyz;
  position=_n_;
run;

proc sql;
  create table want (drop=position) as
    select * from
          (select * from
                  (select ID, symbol, position, count(ID) as count
                     from need
                       group by ID, Symbol
                  )
             group by id
               having count=max(count)
          )
      group by id
        having position=min(position)
  ;
quit;

The code I ran to compare the two sets of suggested code was:

data xyz (drop=i _id);
  input _ID  Symbol:$2. @@;
  do i=1 to 50000;
    id=(_id-1)*50000+i;
    output;
  end;
  datalines;
1 YX 1 XY
1 XY 1 YX
1 XY 1 YX
1 XY 1 YX
1 QP
2 PQ 2 QP
2 PQ 2 QP
2 PQ 2 QP
2 PQ 2 QP
3 IJ 3 JI
3 IJ 3 JI
3 IJ 3 JI
3 IJ 3 JI
4 UV 4 VU
4 UV 4 VU
4 UV 4 VU
4 UV 4 VU
5 XY 5 XY
5 XY 5 XY
5 PQ 5 PQ
5 PQ 5 PQ
5 YX 5 YX
5 YX 5 YX
5 AB 5 AB
5 AB 5 AB
6 PQ 6 QP
6 PQ 6 QP
6 AB 6 BA
6 AB 6 BA
6 CD 6 DC
6 CD 6 DC
6 CD 6 DC
7 AX 7 AD
7 A1 7 A0
;
run;

proc sort data=xyz;
  by id;
run;

data need;
  set xyz;
  position=_n_;
run;

proc sql;
  create table want (drop=position) as
    select * from
          (select * from
                  (select ID, symbol, position, count(ID) as count
                     from need
                       group by ID, Symbol
                  )
             group by id
               having count=max(count)
          )
      group by id
        having position=min(position)
  ;
quit;

data wantN;
  /*initilise hash object with key vars id and symbol and count as data var */
  if _n_ = 1 then do;
    declare hash h();
    h.defineKey('id','symbol');
    h.definedata('count');
    h.defineDone();
  end;
  /*first pass of the dataset to get the count of combination of id and symbol
   and load in the hash object*/
  do until (l);
    set xyz end=l;
    if h.find() ne 0 then do;
      count=1;
      h.replace();
    end;
    else do;
      count=count+1;
      h.replace();
    end;
  end; 
  l=0;
  /*second pass using doubl DOW, 1st dow to get the max of all combinations
  with id */
  do until(l);
    m=.;
    do until(last.id);
      set xyz end=l;
      by id;
      rc=h.find();
      m=max(count,m);
    end; 
    f=0;
    /*this will output the required max top value if ties or 
    just the max if no ties within an id group*/
    do until(last.id);
      set xyz ;
      by id;  
      rc=h.find();
      if m=count and  not f then do;
        output;
        f=1;
      end;
    end;
  end;
  keep id symbol count;
run;

Art, CEO, AnalystFinder.com

 

 

Art, CEO, AnalystFinder.com

 

Saurabh291989
Calcite | Level 5
Apology for replying late...

@norvinosrin and @art297, Thank you, guys. Both of you codes working absolutely fine.
Ksharp
Super User

Is it that necessary to use SQL ?

 

data xyz;
input ID:$6.  Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;
proc sql feedback number;
select * from
(select *, count(ID) as Total from Work.xyz group by Symbol, ID)
group by id
having symbol=min(symbol);
quit;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 25 replies
  • 2150 views
  • 4 likes
  • 6 in conversation