Solved: Re: Getting a single value for each group in values ties situation usi...

Saurabh291989 · Posted 08-04-2018 09:41 AM

I stuck in a situation in an example:

data xyz;
input ID:$6.  Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;
proc sql feedback;
select *, count(ID) as Total from Work.xyz group by Symbol, ID;
quit;

OUTPUT:

Output of above select query.

Instead of the above output, I want only this one as output:

For each group, I want only a single row when values of total column gets tied.

Note: I need a solution for 1 million data, above is just an example. Total value can be different.

Someone help me out.

art297 · Posted 08-06-2018 11:29 AM

@novinosrin: I think that you're correct and I misunderstood what @Saurabh291989 was saying.

As such, here is a variant of @Ksharp's code that produces the same result as your code. On a dataset with 3.5 million records it ran 26% faster than your code. Probably irrelevant difference for @Saurabh291989, as your code ran in 2.45 seconds, while this code ran in 1.91 seconds. However, it also required less time to type and, for me at least, easier to follow:

data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC105 XY
ABC105 PQ
ABC105 YX
ABC105 AB
;
run;

data need;
  set xyz;
  position=_n_;
run;

proc sql;
  create table want (drop=position) as
    select * from
          (select * from
                  (select ID, symbol, position, count(ID) as count
                     from need
                       group by ID, Symbol
                  )
             group by id
               having count=max(count)
          )
      group by id
        having position=min(position)
  ;
quit;

The code I ran to compare the two sets of suggested code was:

data xyz (drop=i _id);
  input _ID  Symbol:$2. @@;
  do i=1 to 50000;
    id=(_id-1)*50000+i;
    output;
  end;
  datalines;
1 YX 1 XY
1 XY 1 YX
1 XY 1 YX
1 XY 1 YX
1 QP
2 PQ 2 QP
2 PQ 2 QP
2 PQ 2 QP
2 PQ 2 QP
3 IJ 3 JI
3 IJ 3 JI
3 IJ 3 JI
3 IJ 3 JI
4 UV 4 VU
4 UV 4 VU
4 UV 4 VU
4 UV 4 VU
5 XY 5 XY
5 XY 5 XY
5 PQ 5 PQ
5 PQ 5 PQ
5 YX 5 YX
5 YX 5 YX
5 AB 5 AB
5 AB 5 AB
6 PQ 6 QP
6 PQ 6 QP
6 AB 6 BA
6 AB 6 BA
6 CD 6 DC
6 CD 6 DC
6 CD 6 DC
7 AX 7 AD
7 A1 7 A0
;
run;

proc sort data=xyz;
  by id;
run;

data need;
  set xyz;
  position=_n_;
run;

proc sql;
  create table want (drop=position) as
    select * from
          (select * from
                  (select ID, symbol, position, count(ID) as count
                     from need
                       group by ID, Symbol
                  )
             group by id
               having count=max(count)
          )
      group by id
        having position=min(position)
  ;
quit;

data wantN;
  /*initilise hash object with key vars id and symbol and count as data var */
  if _n_ = 1 then do;
    declare hash h();
    h.defineKey('id','symbol');
    h.definedata('count');
    h.defineDone();
  end;
  /*first pass of the dataset to get the count of combination of id and symbol
   and load in the hash object*/
  do until (l);
    set xyz end=l;
    if h.find() ne 0 then do;
      count=1;
      h.replace();
    end;
    else do;
      count=count+1;
      h.replace();
    end;
  end; 
  l=0;
  /*second pass using doubl DOW, 1st dow to get the max of all combinations
  with id */
  do until(l);
    m=.;
    do until(last.id);
      set xyz end=l;
      by id;
      rc=h.find();
      m=max(count,m);
    end; 
    f=0;
    /*this will output the required max top value if ties or 
    just the max if no ties within an id group*/
    do until(last.id);
      set xyz ;
      by id;  
      rc=h.find();
      if m=count and  not f then do;
        output;
        f=1;
      end;
    end;
  end;
  keep id symbol count;
run;

Art, CEO, AnalystFinder.com

View solution in original post

PaigeMiller · Posted 08-04-2018 09:47 AM

It's impossible to provide a generalized solution without more information.

Are the 2nd symbol values always mirror images of the first symbol value (example: UV and VU)?

Can there be more than one mirror image pair of symbols in a given ID? (For example, can ABC104 have UV and VU and BG and GB and TR and RT?)

Are the letters in the symbols always adjacent to one another in the alphabet? (V is adjacent to U, but can V appear with A?)

Does this solution HAVE to be in SQL? Because even if you say "YES", I would likely provide a non-SQL solution if I think of one.

--
Paige Miller

Saurabh291989 · Posted 08-04-2018 10:17 AM

@Paige Miller
No, letters in the symbol are not adjacent. They're random in nature. The combination can be any within 26 alphabets and alphanumeric.

I would like to prefer any other non-SQL if there is no solution for SQL.
Provide me the solution, Miller.

PaigeMiller · Posted 08-04-2018 01:15 PM

@Saurabh291989 wrote:
@Paige Miller
No, letters in the symbol are not adjacent. They're random in nature. The combination can be any within 26 alphabets and alphanumeric.

So I asked you to clarify four different points regarding your data, and you clarified one of the four.

--
Paige Miller

Saurabh291989 · Posted 08-06-2018 07:33 AM

@Paige Miller:
Apology for the delay:

1) No, 2nd symbol value is not the mirror images of the 1st symbol.
2) Yes, there are approx. 100-1000k mirror image pair of symbols in a given ID.
3) Nope, the letters in the symbols always not adjacent to one another in the alphabet.
4) Actually, this query has to resolve with the proc SQL but I'll take the chance with the Non-SQL solution.

novinosrin · Posted 08-04-2018 09:53 AM

Diagnose it. That is because you have the following distinct combinations

proc sql ;
select distinct id, symbol from xyz;
*select *, count(ID) as Total from Work.xyz group by Symbol, ID;
quit;

ID Symbol
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ABC101 XY
ABC101 YX
ABC102 PQ
ABC102 QP
ABC103 IJ
ABC103 JI
ABC104 UV
ABC104 VU

So you basically have 8 different groups as a combination of id and symbol

Saurabh291989 · Posted 08-04-2018 10:24 AM

@novinosrin

I know that there are 8 different groups as a combination of id and symbol. But I want only topmost count(Total) value for each id and if count values get tied in each group then it returns only top row i.e. 1st one.

Astounding · Posted 08-04-2018 10:17 AM

I think SQL permits SUBSTR. If so, add to the SELECT statement:

where substr(symbol, 1, 1) <= substr(symbol, 2, 1)

Saurabh291989 · Posted 08-06-2018 07:35 AM

@Astounding:

Sorry, buddy! it didn't work.

novinosrin · Posted 08-04-2018 12:30 PM

data xyz;
input ID:$6.  Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 VU
;
run;
/*i added one more row to test for non ties ABC104 VU */


/*if count values get tied in each group then it returns only top row i.e. 1st one,
since proc sql order is not guaranteed,creating a counter column n to the dataset "as is".*/
data temp;
set xyz;
by id;
if first.id then n=1;
else n+1;
run;

proc sql;
create table want(drop=n f) as
select *, case when std(total)=0 then  min(n) else max(total) end as f
from
(select *, count(ID) as Total from Work.temp group by Symbol, ID	
having n=min(n))
group by id
having (n=f) or (f=total);
quit;

art297 · Posted 08-04-2018 12:40 PM

I also added an extra record and, with my test data, @novinosrin's code didn't produce the desired result. The following requires an extra pass of the data, but with only a million records, shouldn't take very long:

data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;

data for4mat (drop=i j k);
  retain fmtname 'symbols' type 'C';
  length start $2;
  length label $2;
  array astart(36) $2 _temporary_;
  do i=65 to 90,48 to 57;
    j+1;
    astart(j)=byte(i);
  end;
  k=0;
  do i=1 to 36;
    do j=i+1 to 36;
      start=catt(astart(i),astart(j));
      label=catt(astart(i),astart(j));
      output;
      start=catt(astart(j),astart(i));
      label=catt(astart(i),astart(j));
      output;
    end;
  end;
run;

proc format cntlin = for4mat;
run;

data need;
  set xyz;
  length symbol2 $2;
  symbol2=put(symbol,$symbols.);
run;

data need (drop=symbol2 test);
  set need;
  by id symbol2 notsorted;
  length test $2;
  retain test;
  if first.symbol2 then test=symbol;
  if symbol eq test;
run;
  
proc sql;
  create table want as
    select distinct *, count(ID) as Total
      from need
        group by ID, Symbol
  ;
quit;

Art, CEO, AnalystFinder.com

novinosrin · Posted 08-04-2018 01:03 PM

Thank you @art297 and sorry to bother with a request on a weekend, would a tweak

to

having (n=f) or (f=total and n=min(n)); suffice?

EDITED after some serious test: Won't suffice if there is another combination at n=1.. Pardon me.Sorry about that

novinosrin · Posted 08-04-2018 07:55 PM

data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;

data want;
 
 if _n_ = 1 then do;
   declare hash h();
   h.defineKey('id','symbol');
   h.definedata('count');
   h.defineDone();
 end;

do until (l);
   set xyz end=l;
  if h.find() ne 0 then do;count=1;h.replace();end;
  else do;count=count+1;h.replace();end;
end; 
l=0;
do until(l);
	m=.;
	do until(last.id);
		set xyz end=l;
		by id;
		rc=h.find();
 		m=max(count,m);
	end; 
		f=0;
	do until(last.id);
		set xyz ;
		by id;	
		rc=h.find();
		if m=count and  not f then do; output;f=1;end;
	end;
end;
keep id symbol count;
run;

art297 · Posted 08-04-2018 08:43 PM

@novinosrin: Your code needs an adjustment as it's losing the fifth record, i.e.: ABC101 QP

Interestingly, on the small test file, my approach runs faster. Be interesting to see how the two approaches compare on a million record file.

Art, CEO, AnalystFinder.com

novinosrin · Posted 08-04-2018 09:07 PM

Good evening @art297 my understanding is that you can't have duplicate ID in your result and the code is supposed to check for max of counts of combination of groups and pick the earliest group combination within an ID that has max of the counts. Your code is picking two different counts for the same ID ABC101. I am not sure if that is what is desired.

Basically,if my understanding is correct , the algorithm would have to check for

1. compute the counts with combination of id, symbol group by

2. get the max of counts within a group id and not combination

3. if no ties, easy pick that else pick the earliest combination that satisfies the max count

ABC101 QP has a count of 1 whilst other ABC101 has greater counts. So i think that anyway doesn't meet the max count need.

PS

It always makes me nervous to to be part of thread where champs like you are part of. Sorry if any of the above doesn't make sense

Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Re: Getting a single value for each group in values ties situation using proc SQL

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!