BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Saurabh291989
Calcite | Level 5

I stuck in a situation in an example:

 

data xyz;
input ID:$6.  Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;
proc sql feedback;
select *, count(ID) as Total from Work.xyz group by Symbol, ID;
quit;

OUTPUT:

 

Output of above select query.Output of above select query.

Instead of the above output, I want only this one as output:

 For each group, I want only a single row when values of total column gets tied.For each group, I want only a single row when values of total column gets tied. 

Note: I need a solution for 1 million data, above is just an example. Total value can be different.

 

Someone help me out.

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

@novinosrin: I think that you're correct and I misunderstood what @Saurabh291989 was saying.

 

As such, here is a variant of @Ksharp's code that produces the same result as your code. On a dataset with 3.5 million records it ran 26% faster than your code. Probably irrelevant difference for @Saurabh291989, as your code ran in 2.45 seconds, while this code ran in 1.91 seconds. However, it also required less time to type and, for me at least, easier to follow:

 

data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC105 XY
ABC105 PQ
ABC105 YX
ABC105 AB
;
run;

data need;
  set xyz;
  position=_n_;
run;

proc sql;
  create table want (drop=position) as
    select * from
          (select * from
                  (select ID, symbol, position, count(ID) as count
                     from need
                       group by ID, Symbol
                  )
             group by id
               having count=max(count)
          )
      group by id
        having position=min(position)
  ;
quit;

The code I ran to compare the two sets of suggested code was:

data xyz (drop=i _id);
  input _ID  Symbol:$2. @@;
  do i=1 to 50000;
    id=(_id-1)*50000+i;
    output;
  end;
  datalines;
1 YX 1 XY
1 XY 1 YX
1 XY 1 YX
1 XY 1 YX
1 QP
2 PQ 2 QP
2 PQ 2 QP
2 PQ 2 QP
2 PQ 2 QP
3 IJ 3 JI
3 IJ 3 JI
3 IJ 3 JI
3 IJ 3 JI
4 UV 4 VU
4 UV 4 VU
4 UV 4 VU
4 UV 4 VU
5 XY 5 XY
5 XY 5 XY
5 PQ 5 PQ
5 PQ 5 PQ
5 YX 5 YX
5 YX 5 YX
5 AB 5 AB
5 AB 5 AB
6 PQ 6 QP
6 PQ 6 QP
6 AB 6 BA
6 AB 6 BA
6 CD 6 DC
6 CD 6 DC
6 CD 6 DC
7 AX 7 AD
7 A1 7 A0
;
run;

proc sort data=xyz;
  by id;
run;

data need;
  set xyz;
  position=_n_;
run;

proc sql;
  create table want (drop=position) as
    select * from
          (select * from
                  (select ID, symbol, position, count(ID) as count
                     from need
                       group by ID, Symbol
                  )
             group by id
               having count=max(count)
          )
      group by id
        having position=min(position)
  ;
quit;

data wantN;
  /*initilise hash object with key vars id and symbol and count as data var */
  if _n_ = 1 then do;
    declare hash h();
    h.defineKey('id','symbol');
    h.definedata('count');
    h.defineDone();
  end;
  /*first pass of the dataset to get the count of combination of id and symbol
   and load in the hash object*/
  do until (l);
    set xyz end=l;
    if h.find() ne 0 then do;
      count=1;
      h.replace();
    end;
    else do;
      count=count+1;
      h.replace();
    end;
  end; 
  l=0;
  /*second pass using doubl DOW, 1st dow to get the max of all combinations
  with id */
  do until(l);
    m=.;
    do until(last.id);
      set xyz end=l;
      by id;
      rc=h.find();
      m=max(count,m);
    end; 
    f=0;
    /*this will output the required max top value if ties or 
    just the max if no ties within an id group*/
    do until(last.id);
      set xyz ;
      by id;  
      rc=h.find();
      if m=count and  not f then do;
        output;
        f=1;
      end;
    end;
  end;
  keep id symbol count;
run;

Art, CEO, AnalystFinder.com

 

 

Art, CEO, AnalystFinder.com

 

View solution in original post

25 REPLIES 25
PaigeMiller
Diamond | Level 26

It's impossible to provide a generalized solution without more information.

 

Are the 2nd symbol values always mirror images of the first symbol value (example: UV and VU)?


Can there be more than one mirror image pair of symbols in a given ID? (For example, can ABC104 have UV and VU and BG and GB and TR and RT?)

 

Are the letters in the symbols always adjacent to one another in the alphabet? (V is adjacent to U, but can V appear with A?)

 

Does this solution HAVE to be in SQL? Because even if you say "YES", I would likely provide a non-SQL solution if I think of one.

--
Paige Miller
Saurabh291989
Calcite | Level 5
@Paige Miller
No, letters in the symbol are not adjacent. They're random in nature. The combination can be any within 26 alphabets and alphanumeric.

I would like to prefer any other non-SQL if there is no solution for SQL.
Provide me the solution, Miller.
PaigeMiller
Diamond | Level 26

@Saurabh291989 wrote:
@Paige Miller
No, letters in the symbol are not adjacent. They're random in nature. The combination can be any within 26 alphabets and alphanumeric.


So I asked you to clarify four different points regarding your data, and you clarified one of the four.

--
Paige Miller
Saurabh291989
Calcite | Level 5
@Paige Miller:
Apology for the delay:

1) No, 2nd symbol value is not the mirror images of the 1st symbol.
2) Yes, there are approx. 100-1000k mirror image pair of symbols in a given ID.
3) Nope, the letters in the symbols always not adjacent to one another in the alphabet.
4) Actually, this query has to resolve with the proc SQL but I'll take the chance with the Non-SQL solution.
novinosrin
Tourmaline | Level 20

Diagnose it. That is because you have the following distinct combinations 

 


proc sql ;
select distinct id, symbol from xyz;
*select *, count(ID) as Total from Work.xyz group by Symbol, ID;
quit;

 

ID Symbol
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ABC101 XY
ABC101 YX
ABC102 PQ
ABC102 QP
ABC103 IJ
ABC103 JI
ABC104 UV
ABC104 VU

 

So you basically have 8 different groups as a combination of id and symbol

Saurabh291989
Calcite | Level 5
@novinosrin

I know that there are 8 different groups as a combination of id and symbol. But I want only topmost count(Total) value for each id and if count values get tied in each group then it returns only top row i.e. 1st one.
Astounding
PROC Star

I think SQL permits SUBSTR.  If so, add to the SELECT statement:

 

where substr(symbol, 1, 1) <= substr(symbol, 2, 1)

novinosrin
Tourmaline | Level 20
data xyz;
input ID:$6.  Symbol:$2. @@;
datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 VU
;
run;
/*i added one more row to test for non ties ABC104 VU */


/*if count values get tied in each group then it returns only top row i.e. 1st one,
since proc sql order is not guaranteed,creating a counter column n to the dataset "as is".*/
data temp;
set xyz;
by id;
if first.id then n=1;
else n+1;
run;

proc sql;
create table want(drop=n f) as
select *, case when std(total)=0 then  min(n) else max(total) end as f
from
(select *, count(ID) as Total from Work.temp group by Symbol, ID	
having n=min(n))
group by id
having (n=f) or (f=total);
quit;
art297
Opal | Level 21

I also added an extra record and, with my test data, @novinosrin's code didn't produce the desired result. The following requires an extra pass of the data, but with only a million records, shouldn't take very long:

data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;

data for4mat (drop=i j k);
  retain fmtname 'symbols' type 'C';
  length start $2;
  length label $2;
  array astart(36) $2 _temporary_;
  do i=65 to 90,48 to 57;
    j+1;
    astart(j)=byte(i);
  end;
  k=0;
  do i=1 to 36;
    do j=i+1 to 36;
      start=catt(astart(i),astart(j));
      label=catt(astart(i),astart(j));
      output;
      start=catt(astart(j),astart(i));
      label=catt(astart(i),astart(j));
      output;
    end;
  end;
run;

proc format cntlin = for4mat;
run;

data need;
  set xyz;
  length symbol2 $2;
  symbol2=put(symbol,$symbols.);
run;

data need (drop=symbol2 test);
  set need;
  by id symbol2 notsorted;
  length test $2;
  retain test;
  if first.symbol2 then test=symbol;
  if symbol eq test;
run;
  
proc sql;
  create table want as
    select distinct *, count(ID) as Total
      from need
        group by ID, Symbol
  ;
quit;

Art, CEO, AnalystFinder.com

 

novinosrin
Tourmaline | Level 20

Thank you @art297 and sorry to bother with a request on a weekend, would a tweak 

to 

having (n=f) or (f=total and n=min(n)); suffice?

 

EDITED after some serious test: Won't suffice if there is another combination at n=1.. Pardon me.Sorry about that

novinosrin
Tourmaline | Level 20
data xyz;
  input ID:$6.  Symbol:$2. @@;
  datalines;
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 XY ABC101 YX
ABC101 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC102 PQ ABC102 QP
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC103 IJ ABC103 JI
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
ABC104 UV ABC104 VU
;
run;

data want;
 
 if _n_ = 1 then do;
   declare hash h();
   h.defineKey('id','symbol');
   h.definedata('count');
   h.defineDone();
 end;

do until (l);
   set xyz end=l;
  if h.find() ne 0 then do;count=1;h.replace();end;
  else do;count=count+1;h.replace();end;
end; 
l=0;
do until(l);
	m=.;
	do until(last.id);
		set xyz end=l;
		by id;
		rc=h.find();
 		m=max(count,m);
	end; 
		f=0;
	do until(last.id);
		set xyz ;
		by id;	
		rc=h.find();
		if m=count and  not f then do; output;f=1;end;
	end;
end;
keep id symbol count;
run;
art297
Opal | Level 21

@novinosrin: Your code needs an adjustment as it's losing the fifth record, i.e.: ABC101 QP

 

Interestingly, on the small test file, my approach runs faster. Be interesting to see how the two approaches compare on a million record file.

 

Art, CEO, AnalystFinder.com

 

novinosrin
Tourmaline | Level 20

Good evening @art297 my understanding is that you can't have duplicate ID in your result and the code is supposed to check for max of counts of combination of groups and pick the earliest group combination within an ID that has max of the counts. Your code is picking two different counts for the same ID ABC101. I am not sure if that is what is desired.

 

Basically,if my understanding is correct , the algorithm would have to check for

1. compute the counts with combination of id, symbol group by

2. get the max of counts within a group id  and not combination

3. if no ties, easy pick that else pick the earliest combination that satisfies the max count

 

ABC101 QP has a count of 1 whilst other ABC101 has greater counts. So i think that anyway doesn't meet the max count need.

 

PS

It always makes me nervous to to be  part of thread where champs like you are part of. Sorry if any of the above doesn't make sense

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 25 replies
  • 2152 views
  • 4 likes
  • 6 in conversation