Solved: Re: Getting a single value for each group in values ties situation usi... - Page 2

art297 · Posted 08-04-2018 09:32 PM

@novinosrin: Makes perfect sense! Either you're right, I am, or neither of us are.

Don't be nervous! You have picked up so much in the short time you've been using/learning SAS that one would be hard pressed to distinguish your posts from any of the old timers. In fact, your code has often been the best proposed.

Art, CEO, AnalystFinder.com

Saurabh291989 · Posted 08-06-2018 07:45 AM

@novinosrin:
Yes, @novinosrin, you understood my problem correctly.

Your quotes:

" Basically,if my understanding is correct , the algorithm would have to check for

1. compute the counts with combination of id, symbol group by

2. get the max of counts within a group id and not combination

3. if no ties, easy pick that else pick the earliest combination that satisfies the max count"

That's what I'm looking for.

art297 · Posted 08-06-2018 09:28 AM

Just have one question: given the following example dataset, what result would you expect for ID=ABC101?

data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;

Art, CEO, AnalystFinder.com

Saurabh291989 · Posted 08-06-2018 09:56 AM

@ art297:
Actually, I'm looking for the topmost value for the ABC101 i.e. it doesn't matter whether it is XY, YX or QP. Only matters, it should be maxed in the count and if a group count values tie, it selects only the 1st value.
example:
ID SYMBOL COUNT
ABC101 XY 4
ABC101 PQ 4
ABC101 YX 4
ABC101 AB 4

if this situation arrived only top value showed up in the result i.e. ABC101 XY 4

art297 · Posted 08-06-2018 10:25 AM

@Saurabh291989: In that case you still don't have a solution as none of the current offerings would provide that result.

Art, CEO, AnalystFinder.com

novinosrin · Posted 08-06-2018 10:38 AM

@Saurabh291989 Have you tested this yet? I added some comments

Morning @art297 I thought this meets OP's req. Have you noticed anywhere going off the rails?

data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;

data want;
/*initilise hash object with key vars id and symbol and count as data var */
 if _n_ = 1 then do;
   declare hash h();
   h.defineKey('id','symbol');
   h.definedata('count');
   h.defineDone();
 end;
/*first pass of the dataset to get the count of combination of id and symbol
 and load in the hash object*/
do until (l);
   set xyz end=l;
  if h.find() ne 0 then do;count=1;h.replace();end;
  else do;count=count+1;h.replace();end;
end; 
l=0;
/*second pass using doubl DOW, 1st dow to get the max of all combinations
with id */
do until(l);
	m=.;
	do until(last.id);
		set xyz end=l;
		by id;
		rc=h.find();
 		m=max(count,m);
	end; 
		f=0;
/*this will output the required max top value if ties or 
just the max if no ties within an id group*/
	do until(last.id);
		set xyz ;
		by id;	
		rc=h.find();
		if m=count and  not f then do; output;f=1;end;
	end;
end;
keep id symbol count;
run;

art297 · Posted 08-06-2018 10:51 AM

@novinosrin: Yes I have identified a case where it doesn't work .. the first one ID=ABC101.

If I correctly understand @Saurabh291989's latest post, it should have a count of 5, where your code results in a count of 4.

Also, here is an example that uses the above dataset, but adds the new example that was included in @Saurabh291989's latest post. Your code produces a count of 1, rather than 4, for that new example:

data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC105 XY
ABC105 PQ
ABC105 YX
ABC105 AB
;
run;

data want;
/*initilise hash object with key vars id and symbol and count as data var */
 if _n_ = 1 then do;
   declare hash h();
   h.defineKey('id','symbol');
   h.definedata('count');
   h.defineDone();
 end;
/*first pass of the dataset to get the count of combination of id and symbol
 and load in the hash object*/
do until (l);
   set xyz end=l;
  if h.find() ne 0 then do;count=1;h.replace();end;
  else do;count=count+1;h.replace();end;
end; 
l=0;
/*second pass using doubl DOW, 1st dow to get the max of all combinations
with id */
do until(l);
    m=.;
    do until(last.id);
        set xyz end=l;
        by id;
        rc=h.find();
        m=max(count,m);
    end; 
        f=0;
/*this will output the required max top value if ties or 
just the max if no ties within an id group*/
    do until(last.id);
        set xyz ;
        by id;  
        rc=h.find();
        if m=count and  not f then do; output;f=1;end;
    end;
end;
keep id symbol count;
run;

which results in the following:

Art, CEO, AnalystFinder.com

novinosrin · Posted 08-06-2018 11:03 AM

@art297

My understanding was/is

1. compute the counts with combination of id, symbol group by

This means the combination count will result exactly as it did

then compare the count across the id and within ( id makes it parent group)

ABC101 has XY,YX,QP child groups XY count=4, YX=Count=4 and QP=1 the max(,XY,YX,QP) within ABC101 is 4

But the catch is there is a tie between XY and YX. However OP says pick the first record within the ID group. And in this case XY qualifies.

art297 · Posted 08-06-2018 11:29 AM

@novinosrin: I think that you're correct and I misunderstood what @Saurabh291989 was saying.

As such, here is a variant of @Ksharp's code that produces the same result as your code. On a dataset with 3.5 million records it ran 26% faster than your code. Probably irrelevant difference for @Saurabh291989, as your code ran in 2.45 seconds, while this code ran in 1.91 seconds. However, it also required less time to type and, for me at least, easier to follow:

data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC105 XY
ABC105 PQ
ABC105 YX
ABC105 AB
;
run;

data need;
  set xyz;
  position=_n_;
run;

proc sql;
  create table want (drop=position) as
    select * from
          (select * from
                  (select ID, symbol, position, count(ID) as count
                     from need
                       group by ID, Symbol
                  )
             group by id
               having count=max(count)
          )
      group by id
        having position=min(position)
  ;
quit;

The code I ran to compare the two sets of suggested code was:

data xyz (drop=i _id);
  input _ID  Symbol:$2. @@;
  do i=1 to 50000;
    id=(_id-1)*50000+i;
    output;
  end;
  datalines;
1 YX 1 XY
1 XY 1 YX
1 XY 1 YX
1 XY 1 YX
1 QP
2 PQ 2 QP
2 PQ 2 QP
2 PQ 2 QP
2 PQ 2 QP
3 IJ 3 JI
3 IJ 3 JI
3 IJ 3 JI
3 IJ 3 JI
4 UV 4 VU
4 UV 4 VU
4 UV 4 VU
4 UV 4 VU
5 XY 5 XY
5 XY 5 XY
5 PQ 5 PQ
5 PQ 5 PQ
5 YX 5 YX
5 YX 5 YX
5 AB 5 AB
5 AB 5 AB
6 PQ 6 QP
6 PQ 6 QP
6 AB 6 BA
6 AB 6 BA
6 CD 6 DC
6 CD 6 DC
6 CD 6 DC
7 AX 7 AD
7 A1 7 A0
;
run;

proc sort data=xyz;
  by id;
run;

data need;
  set xyz;
  position=_n_;
run;

proc sql;
  create table want (drop=position) as
    select * from
          (select * from
                  (select ID, symbol, position, count(ID) as count
                     from need
                       group by ID, Symbol
                  )
             group by id
               having count=max(count)
          )
      group by id
        having position=min(position)
  ;
quit;

data wantN;
  /*initilise hash object with key vars id and symbol and count as data var */
  if _n_ = 1 then do;
    declare hash h();
    h.defineKey('id','symbol');
    h.definedata('count');
    h.defineDone();
  end;
  /*first pass of the dataset to get the count of combination of id and symbol
   and load in the hash object*/
  do until (l);
    set xyz end=l;
    if h.find() ne 0 then do;
      count=1;
      h.replace();
    end;
    else do;
      count=count+1;
      h.replace();
    end;
  end; 
  l=0;
  /*second pass using doubl DOW, 1st dow to get the max of all combinations
  with id */
  do until(l);
    m=.;
    do until(last.id);
      set xyz end=l;
      by id;
      rc=h.find();
      m=max(count,m);
    end; 
    f=0;
    /*this will output the required max top value if ties or 
    just the max if no ties within an id group*/
    do until(last.id);
      set xyz ;
      by id;  
      rc=h.find();
      if m=count and  not f then do;
        output;
        f=1;
      end;
    end;
  end;
  keep id symbol count;
run;

Art, CEO, AnalystFinder.com

Saurabh291989 · Posted 08-14-2018 09:22 AM

Apology for replying late...

@norvinosrin and @art297, Thank you, guys. Both of you codes working absolutely fine.

Ksharp · Posted 08-05-2018 07:25 AM

Is it that necessary to use SQL ?

data xyz;
input ID:$6.  Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;
proc sql feedback number;
select * from
(select *, count(ID) as Total from Work.xyz group by Symbol, ID)
group by id
having symbol=min(symbol);
quit;

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!