Requesting efficent look up solution for a large dataset?

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 137
Accepted Solution

Requesting efficent look up solution for a large dataset?

Hi, Requesting your help for an efficient look up code either hashes, format or whatever for the following problem.

I have two Data sets with the following variables and values:

Dataset1:

IDRating1Rating2Rating3
900129
2000425
900135Baa1A-
2000328
900152Baa1NRA
900265Ba1BBB-
900005Baa1A+
900108Baa2BBB+A
900121A1AA+
900008A2A+
6964444Baa3BB+
900149BBB+
900124
000100-502
900153A2A+
2000315
900136Baa1BBB+
900122Aa3AA-AA-
900155A3BBB+BBB+
1103127
900185A1A+
900151Baa1
900172A2A-A
900007A-A
99900096Aa3A+A+
900216A2NRA+
99900169Baa1BBB+A+
900016A1A-A+
900385Aa3A+AA-

Dataset2:

RankRating_LKUPQ-RATINGR_RATINGS_RATINGR%
231Aaa AAA AAA 20%
222+Aa1 AA+ AA+ 20%
212Aa2 AA AA 20%
202-Aa3 AA- AA- 20%
193+A1 A+ A+ 50%
183A2 A A 50%
173-A3 A- A- 50%
164+Baa1 BBB+ BBB+ 100%
154Baa2 BBB BBB 100%
144-Baa3 BBB- BBB- 100%
135+Ba1 BB+ BB+ 100%
125Ba2 BB BB 100%
115-Ba3 BB- BB- 100%
106+B1 B+ B+ 150%
96B2 B B 150%
86-B3 B- B- 150%
77+Caa1 CCC+ CCC+ 150%
67Caa2 CCC CCC 150%
57-Caa3 CCC- CCC- 150%
48
39
210
1URUnratedUnratedUnrated

My objective is I need a solution to look up dataset1 to dataset 2 to populate the correct R% from dataset2 in dataset1 for each ID based on the following logic:-

1. For each ID there can either one or 3 ratings or none in dataset1. If there is one or more rating in dataset1 the corresponding look up columns in dataset2 are Q-Rating, R-rating and S-rating as you can see above.

2. For example in row3 of dataset, you would notice for ID:900135 there are 2 ratings Baa1 and A- in dataset1. The corresponding look up value i.e A- and Baa1  can be found in 7th and 8th row of dataset2 respectively.

3. So yeah, if there are 2 ratings in dataset1, there would be 2 records with corresponding look up values in dataset2 and likewise if 3 there are 3 ratings, there would be 3 records of corresponding look up values, and at max there can be only 3  in dataset2.

4. I need to pick the R% for each ID in dataset1 using the lowest rating or in other words with an example for ID:900135 i need to pick the R% from dataset2 using the obs that has the lowest rating, which in this case is 4+ rather than 3- considering the highest to lowest scale happens to be from 1 to 10 and so the R% would be 100% comparing 50% and 100%.

5. Therefore for the example ID:900135, I want the final result(WANT) as -

ID             Rating1      Rating2    Rating3   R%

900135      Baa1                         A-          100   /*this has to be derived with 4+ being lower than 3- in look up rating when comparing Baa1 and A- , if that makes sense*/

I'd appreciate any help I can receive and that would really mean a lot.

Many Thanks,

Charlotte


Accepted Solutions
Solution
‎01-20-2015 04:38 PM
Trusted Advisor
Posts: 1,301

Re: Requesting efficent look up solution for a large dataset?

,

I had a mistake one like of the code:

Change:

_d[_n_]=divide(find(_z[_n_], _c[_n_], 't'), 7);

To:

_d[_n_]=floor(divide(findw(_z[_n_], _c[_n_], ,'st'), 7))+1;

View solution in original post


All Replies
Valued Guide
Posts: 3,208

Re: Requesting efficent look up solution for a large dataset?

You are knowing the usage with SAS formats?

That dataset2 is seeming to me pretty small.

---->-- ja karman --<-----
Trusted Advisor
Posts: 1,301

Re: Requesting efficent look up solution for a large dataset?

data one;

infile cards truncover;

input id $ (rating1-rating3) (:$7.);

cards;

900129 

2000425 

900135 Baa1  A-

2000328 

900152 Baa1 NR A

900265 Ba1 BBB-

900005 Baa1  A+

900108 Baa2 BBB+ A

900121 A1 A A+

900008 A2  A+

6964444 Baa3 BB+

900149   BBB+

900124 

000100-502 

900153 A2  A+

2000315 

900136 Baa1 BBB+

900122 Aa3 AA- AA-

900155 A3 BBB+ BBB+

1103127 

900185 A1  A+

900151 Baa1

900172 A2 A- A

900007  A- A

99900096 Aa3 A+ A+

900216 A2 NR A+

99900169 Baa1 BBB+ A+

900016 A1 A- A+

900385 Aa3 A+ AA-

;

data two;

infile cards truncover;

input Rank (Rating_LKUP Q_RATING R_RATING S_RATING R_PCT) ($);

cards;

23 1 Aaa AAA AAA 20%

22 2+ Aa1 AA+ AA+ 20%

21 2 Aa2 AA AA 20%

20 2- Aa3 AA- AA- 20%

19 3+ A1 A+ A+ 50%

18 3 A2 A A 50%

17 3- A3 A- A- 50%

16 4+ Baa1 BBB+ BBB+ 100%

15 4 Baa2 BBB BBB 100%

14 4- Baa3 BBB- BBB- 100%

13 5+ Ba1 BB+ BB+ 100%

12 5 Ba2 BB BB 100%

11 5- Ba3 BB- BB- 100%

10 6+ B1 B+ B+ 150%

9 6 B2 B B 150%

8 6- B3 B- B- 150%

7 7+ Caa1 CCC+ CCC+ 150%

6 7 Caa2 CCC CCC 150%

5 7- Caa3 CCC- CCC- 150%

4 8  

3 9  

2 10  

1 UR Unrated Unrated Unrated

;

data x;

do _n_=1 by 1 until(p1done);

set two end=p1done;

by descending rank;

array _p[0:23] $7 _temporary_;

_p[_n_]=R_PCT;

array _z_[3,23] $7 _temporary_;

_z_[1,_n_]=Q_RATING;

_z_[2,_n_]=R_RATING;

_z_[3,_n_]=S_RATING;

end; /*p1done*/

array _z[3] $161 _temporary_;

do _n_=1 to 3;

_z[_n_]=peekclong(addrlong(_z_[_n_,1]), 161);

end;

do until(p2done);

set one end=p2done;

array _c[3] rating1-rating3;

array _d[3] _temporary_;

do _n_=1 to 3;

if missing(_c[_n_]) then continue;

_d[_n_]=floor(divide(findw(_z[_n_], _c[_n_], ,'st'), 7))+1;

end;

r_pct=_p[max(of _d

  • , 0)];
  • output;

    call missing(of _d

  • );
  • end; /*p2done*/

    stop;

    keep id rating1-rating3 r_pct;

    run;

    Frequent Contributor
    Posts: 137

    Re: Requesting efficent look up solution for a large dataset?

    Sir,  First off thank you so much for your great response and for the marvelous and classy approach!..Just a small clarification. I just ran your code and it produced the following result

    R_PCTidrating1rating2rating3
    900129
    2000425
    50%
    900135Baa1A-

    /*for id 900135, the r_pct should be 100% instead of 50% for the reason 4+rating_lkup of bba1 is lower than 3- rating_lkup, as we need to pick r_pct for the lowest rating and as the scale goes 1-10 with 1 being the highest rating and 10 being the lowest.*/

    May i request that slight change?

    Please accept my apologies for the bother and I must say I can't thank you enough.

    Sincerely,

    Charlotte

    Solution
    ‎01-20-2015 04:38 PM
    Trusted Advisor
    Posts: 1,301

    Re: Requesting efficent look up solution for a large dataset?

    ,

    I had a mistake one like of the code:

    Change:

    _d[_n_]=divide(find(_z[_n_], _c[_n_], 't'), 7);

    To:

    _d[_n_]=floor(divide(findw(_z[_n_], _c[_n_], ,'st'), 7))+1;

    Super User
    Posts: 9,854

    Re: Requesting efficent look up solution for a large dataset?

    I use Rank as rate level . i.e. pick up the min of Rank.

    data one;
    infile cards truncover;
    input id : $20. (rating1-rating3) ($);
    cards;
    900129 . . .
    2000425. . . 
    900135 Baa1 .  A-
    2000328 . . .
    900152 Baa1 NR A
    900265 Ba1 BBB-      .
    900005 Baa1 .  A+
    900108 Baa2 BBB+ A
    900121 A1 A A+
    900008 A2 .  A+
    6964444 Baa3 BB+  .
    900149 . .  BBB+
    900124 . . .
    000100-502 . . .
    900153 A2 .  A+
    2000315 . . .
    900136 Baa1 BBB+ .
    900122 Aa3 AA- AA-
    900155 A3 BBB+ BBB+
    1103127 . . .
    900185 A1 .  A+
    900151 Baa1 . .
    900172 A2 A- A
    900007  . A- A
    99900096 Aa3 A+ A+
    900216 A2 NR A+
    99900169 Baa1 BBB+ A+
    900016 A1 A- A+
    900385 Aa3 A+ AA-
    ;
     
     
    data two;
    infile cards truncover;
    input Rank (Rating_LKUP Q_RATING R_RATING S_RATING R_PCT) ($);
    cards;
    23 1 Aaa AAA AAA 20%
    22 2+ Aa1 AA+ AA+ 20%
    21 2 Aa2 AA AA 20%
    20 2- Aa3 AA- AA- 20%
    19 3+ A1 A+ A+ 50%
    18 3 A2 A A 50%
    17 3- A3 A- A- 50%
    16 4+ Baa1 BBB+ BBB+ 100%
    15 4 Baa2 BBB BBB 100%
    14 4- Baa3 BBB- BBB- 100%
    13 5+ Ba1 BB+ BB+ 100%
    12 5 Ba2 BB BB 100%
    11 5- Ba3 BB- BB- 100%
    10 6+ B1 B+ B+ 150%
    9 6 B2 B B 150%
    8 6- B3 B- B- 150%
    7 7+ Caa1 CCC+ CCC+ 150%
    6 7 Caa2 CCC CCC 150%
    5 7- Caa3 CCC- CCC- 150%
    4 8   . . . . 
    3 9   . . . . 
    2 10  . . . . 
    1 UR Unrated Unrated Unrated . 
    ;
    run;
    
    data want;
     if _n_ eq 1 then do;
      if 0 then set two;
      declare hash h1(dataset:'two');
      h1.definekey('Q_RATING');
      h1.definedata('Rank','R_PCT');
      h1.definedone();
    
      declare hash h2(dataset:'two');
      h2.definekey('R_RATING');
      h2.definedata('Rank','R_PCT');
      h2.definedone();
    
      declare hash h3(dataset:'two');
      h3.definekey('S_RATING');
      h3.definedata('Rank','R_PCT');
      h3.definedone();
     end;
    set one;
    min_rank=9999;
    if h1.find(key:rating1)=0 and not missing(rating1) then do;
       if Rank lt min_rank then do;min_rank=Rank; R=R_PCT;end;
    end;
    if h2.find(key:rating2)=0 and not missing(rating2) then do;
       if Rank lt min_rank then do;min_rank=Rank; R=R_PCT;end;
    end;
    if h3.find(key:rating3)=0 and not missing(rating3) then do;
       if Rank lt min_rank then do;min_rank=Rank; R=R_PCT;end;
    end;
    keep id rating1-rating3 R;
    run;
    
    
    

    Xia Keshan

    Trusted Advisor
    Posts: 1,301

    Re: Requesting efficent look up solution for a large dataset?

    Here is a different approach using formats:

    data fmts;

    set two end=done;

    by descending rank;

    array a[3] Q_RATING R_RATING S_RATING;

    do i=1 to dim(a);

    start=a;

    if missing(start) then continue;

    fmtname=cats(scan(vname(a), 1, '_'), 'rank');

    label=put(rank, best. -l);

    type='i';

    output;

    end;

    if done then do fmtname='Qrank', 'Rrank', 'Srank';

    hlo='o';

    label='';

    output;

    end;

    hlo='';

    fmtname='rank2pct';

    start=put(rank, best. -l);

    label=r_pct;

    type='n';

    output;

    if done then do;

    hlo='o';

    label='';

    output;

    end;

    run;

    proc sort data=fmts; by fmtname start; run;

    proc format cntlin=fmts; run;

    data x;

    set one;

    array a[3] rating1-rating3;

    array fmt[3] $ 6 _temporary_ ('qrank.' 'rrank.' 'srank.');

    array b[3] _temporary_;

    do _n_=1 to 3;

    if missing(a[_n_]) then continue;

    b[_n_]=inputn(a[_n_], fmt[_n_]);

    end;

    r_pct=put(min(of b

  • , 24), rank2pct.);
  • run;

    Frequent Contributor
    Posts: 137

    Re: Requesting efficent look up solution for a large dataset?

    Hi Magnificent Greats, Please do not doubt my courtesy to have not marked the answer correct yet as I am yet to test the same on my office real work environment. Like I said earlier, I can't thank you enough for the help and will certainly respond soon once I am done testing on my work environment and cheer a green. I am sure yours/Xia's will definitely work, but i am just holding  to not yet close the question for the reason if I may have any further clarifications.

    Above all, Thank you for your invaluable time and that is something I could never pay back.

    Sincere and utmost Thanks,

    Charlotte from England

    Super User
    Super User
    Posts: 7,668

    Re: Requesting efficent look up solution for a large dataset?

    You can do a similar thing in SQL  (not optimized as not much time to look at this):

    data dataset1;
      id=900135; rating1="Baa1"; rating2=""; rating3="A-"; output;
    run;

    data dataset2;
      length rating q_rating r_rating s_rating $20;
      rank=17; rating="-3"; q_rating="A3"; r_rating="A-"; s_rating="A-"; r_pcent=50; output;
      rank=16; rating="4+"; q_rating="Baa1"; r_rating="BBB+"; s_rating="BBB+"; r_pcent=100; output;
    run;

    proc sql;
      create table INTER as
      select  A.*,
              (select min(RANK) from DATASET2 where A.RATING1=Q_RATING or A.RATING2=R_RATING or A.RATING3=S_RATING) as MIN_RANK
      from    DATASET1 A;
      create table WANT as
      select  A.*,
              B.R_PCENT
      from    INTER A
      left join DATASET2 B
      on      A.MIN_RANK=B.RANK;
    quit;

    Although it would be easier if the data was normalized.

    Trusted Advisor
    Posts: 1,301

    Re: Requesting efficent look up solution for a large dataset?

    I appreciate your well constructed question.  It was entertaining to solve.  Had Xia not posted the solution using hash, I would have also.  Always good to see the different methods.

    Frequent Contributor
    Posts: 137

    Re: Requesting efficent look up solution for a large dataset?

    Sir/Xia/RW, Awesome, elegant or eloquent  are not just enough to describe your solutions but my vocabulary is limited and they worked amazingly well. I am sure you are tired of and are used to too many appreciations from many, but i sincerely want to thank you for your generosity by contributing your invaluable time by helping me with this problem.

    Thanks a ton and Cheers,

    Charlotte from England!

    Oops, by the way, may i ask how long have you been doing SAS programming?Smiley Happy

    Trusted Advisor
    Posts: 1,301

    Re: Requesting efficent look up solution for a large dataset?

    Professionally, for about 8-9 years and 2-3 years as a student prior to that.

    Frequent Contributor
    Posts: 137

    Re: Requesting efficent look up solution for a large dataset?

    Sir, Good evening from England, Sorry for the bother yet again, May i request your help in modifying your code that brilliantly uses the multi dim temp array/*my comment- that was a class act of yours*/ or even the formats approach to work for a slightly altered requirement:-

    1. The Dataset1 has another variable by the name Market_type, that has 3 kinds of values namely "Retail", "Wholesale" and "SME". What it means is that the ID, being a business entity belongs to one of the 3 market types.

    2. Dataset 2, being the rating look up Data set remains the same, however there is one separate "Rating look up table" for Retail, and one for wholesale and likewise one for SME. So In essence there are 3 different rating look up data sets.

    3. Now the point is that, the look up needs to be done from Dataset 1 <---> Retail look up dataset for id's that are retail in market_type exclusively. And similarly from Dataset1 <---> Wholesale look up table for id's that are wholesale and Dataset1<--->SME look up table. /*Sounds like a filter to be applied in dataset1 for my novice level of understanding*/

    4. The logic to get the R% remains same, expect "when there are 3 ratings for an ID, I need to pick R% for ID's using the  middle of the 3 lkup_ratings instead of the lowest as you did before otherwise for remaining occurances such as 1 or 2 ratings the same lowest lkup ratings applies as before."

    5. If there are no ratings for an ID with blank values, that means the ID falls into the category of Unrated, which will have to pick the default R% from respective Ratings datasets from the unrated observation. /*The weird thing here is that in some rating look up table, the unrated observation is represented as blank unlike what you see in the last row in Dataset2 where it is clearly mentioned*/

    1URUnratedUnratedUnrated


    I fear whether I have explained well enough to make it easy enough to understand. If not, please let me know if I can explain better. I'd sincerely appreciate your help. Many thanks as always!


    With Pleasant Regards,

    Charlotte

    Trusted Advisor
    Posts: 1,301

    Re: Requesting efficent look up solution for a large dataset?

    Charlotte,

    I would challenge you to take one of these examples and attempt to modify it to meet this additional criteria.  The hash method presented by would not be difficult to add additional ratings for with the new Market_type class and the same goes for the method using formats.  The first method I posted would probably be the most complex to add this new functionality to, which, if that is the method you chose, may be why you are finding this difficult.

    For the hash method you could do something like the following:

    select(market_type);

    when ('retail') do;

    ...get the min rating;

    end;

    when ('wholesale') do;

    ...

    otherwise;

    end;

    You will also want to change from looking from the MIN_RANK to MAX_RANK, if I remember you problem correctly and set the default for MAX_RANK to 1 to get the UR result you want.

    For modifying the Format method:

    create the 3 data sets to 9 unique formats

    add a second dimension to the array fmt, this dimension will be the market_type

    array fmt[3,0:4] $ 32 _temporary_ ('retail' 'retailQrank.' 'retailRrank.' 'retailStype.' 'retailrank2pct.' ...);

    and then lookup with help from whichc function

    b[_n_]=inputn(a[_n_], fmt[whichc(market_type, fmt[1,0], fmt[2,0], fmt[3,0])_n_]);

    then modify the rank2pct lookup in suit and add logic to set to UR when needed.

    Frequent Contributor
    Posts: 137

    Re: Requesting efficent look up solution for a large dataset?

    Good morning,Thank you and really sorry, I was a bit too overambitious and really fancied your first approach before reality bit me hard. :smileycry:To keep it simple, what I did was I filtered the dataset1 to output three different datasets  based on the market_type condition, thus having 3 subsets, one for retail, one for wholesale and one for SME. This makes it easier for me to get back to using the original formats method code you helped me with and I can run three times. No problems there as I don't have to worry about the additional criteria 1,2 and 3.

    But this condition dumps me though-

    4. The logic to get the R% remains same, expect "when there are 3 ratings for an ID, I need to pick R% for ID's using the  middle of the 3 lkup_ratings instead of the lowest as you did before otherwise for remaining occurances such as 1 or 2 ratings the same lowest lkup ratings applies as before."

    Middle rating??? makes it very difficult for meSmiley Sad whereas min or max is rather clear and easy to understand though

    In the last part of your format code:

    do _n_=1 to 3;

    if missing(a[_n_]) then continue;

    b[_n_]=inputn(a[_n_], fmt[_n_]);

    end;

    r_pct=put(min(of b

  • , 24), rank2pct.); /*logic change or addition here?to add the condition to pick the middle rating R% when there are 3 ratings?*/
  • Thanks so much,

    Charlotte

    🔒 This topic is solved and locked.

    Need further help from the community? Please ask a new question.

    Discussion stats
    • 27 replies
    • 863 views
    • 11 likes
    • 5 in conversation