Solved: Requesting efficent look up solution for a large dataset? - Page 2

FriedEgg · Posted 02-05-2015 11:33 AM

Charlotte,

While splitting the data apart in this manner is not always the most efficient method, I am glad to hear that you found a solution to work for you. For this 4th criteria you just need a simple logic change. By taking the b array and sorting it's values you can easily target the min, max or middle value. You will either want the b[1] to get the min value, or b[2] to get the middle value when the number of missing elements in the b array is zero.

call sortn(of b

);

r_pct = put(ifn(nmiss(of b

)=0, b[2], b[1]), rank2pct.);

CharlotteCain · Posted 02-05-2015 01:37 PM

Thank you Sir, I did the change in the last piece as suggested and ran the code. However I noticed that there are quite a few missing values for r_pct as it hasn't picked for some of the id's.

Honestly, the reason I am finding the 4th criteria so hard to understand is because comprehending the combination it presents confuses me.

For example-

if dataset one has:

ID Rating1 Rating2 Rating3

10540 A B C

and in Dataset2(the look up table) if A, B,C are all found in the same row then obviously I recognise there can be only one R% to fetch. If A and B are in one row and C happens to be in another row then I would have to pick R% the lower of the two ratings because in this case A and B will have the same R%. And most importantly, if A, B and C i.e each one of these are found in 3 different rows, this is where I need to pick the R5 using the middle rating and not the lowest. Understanding this makes me mental:smileycry:. Hence the reason, I reached out for help.

/*changed code */

data x;

set one;

array a[3] rating1-rating3;

array fmt[3] $ 6 _temporary_ ('qrank.' 'rrank.' 'srank.');

array b[3] _temporary_;

do _n_=1 to 3;

if missing(a[_n_]) then continue;

b[_n_]=inputn(a[_n_], fmt[_n_]);

end;

call sortn(of b

);

*r_pct=put(min(of b

, 24), rank2pct.); /* removed this*/

r_pct = put(ifn(nmiss(of b

)=0, b[2], b[1]), rank2pct.);

run;

/*Below is the Output where I noticed missing r_pct for id's that have one or more ratings.

id	rating1	rating2	rating3	r_pct
900129
2000425
900135	Baa1	A-
2000328
900152	Baa1	NR	A
900265	Ba1	BBB-		100%
900005	Baa1	A+		20%
900108	Baa2	BBB+	A	100%
900121	A1	A	A+	50%
900008	A2	A+		50%
6964444	Baa3	BB+		100%
900149	BBB+
900124
000100-5
900153	A2	A+		50%
2000315				50%
900136	Baa1	BBB+		100%
900122	Aa3	AA-	AA-	20%
900155	A3	BBB+	BBB+	100%
1103127				100%
900185	A1	A+		50%
900151	Baa1			50%
900172	A2	A-	A	20%
900007	A-	A

FriedEgg · Posted 02-06-2015 01:55 PM

Charlotte,

I posted the sort and choose method without thinking through it. It would choose missing values frequently when it shouldn't, because the missing values would be sorted to the front. Tentatively, I will say the following will address the issue:

call sortn(of b

); *sort the ratings rank array in ascending order, nulls first;

r_pct = put(ifn(nmiss(of b

)=0, b[2], min(of b

, 23)

), rank2pct.);

If the number of missing elements in the array b is zero, then we have a ranking for each rating and we want to select the middle value. Since we sorted the array, the middle value with be present in the second index of the array b[2]

If the number of missing elements is not zero, then we want to take the min value for ranking for ratings in array b, or if all are null, we want to assign 23/UR (this was 24 previously, which equated to missing, this may still not be correct as I am working from memory)

I am going to look back over this thread later to make sure.

RW9 · Posted 01-23-2015 04:09 AM

12 years proff. Seems like a lifetime, oh, it is :smileyshocked:

jakarman · Posted 01-23-2015 04:22 AM

Sum those and you are still not there. Ow I am belonging to those oldies.

---->-- ja karman --<-----

CharlotteCain · Posted 01-23-2015 11:37 AM

Well Sir,, that's why you are smart and experienced across all SAS modules. I think I remember mentioning you as not everbody can be "JAAP of all trades" in one of his posts . Looks like he is enjoying his summer. I may be blonde but i have some memory after all. Hahaha.

Well gentlemen, thanks for everything until my next question(smiles) and bear with me until i turn from blonde to brilliant like you guys so I wouldn't have to bother so much.

jakarman · Posted 02-05-2015 03:48 PM

Charlotte your are getting nice code from Matthew with very little explanation how it works and the why of choices.
Reading q&a that would be more helpful for you, don't you think so?

---->-- ja karman --<-----

CharlotteCain · Posted 02-06-2015 11:18 AM

I agree, I love this interaction anyway, and that only inspires me more and more, and as a matter of fact I am starting to devote my weekends(sat and Sun) to learn SAS more thoroughly, so no pub(giving up my englishness lol), no fun and no boy friend hahaha but true. Nevertheless, at present it seems like one needs to first be a Matt or Xia or Jaap to even understand their answers so i need to become one to not irritate either of you guys. Wish I could express my heartfelt thanks to Matt, as my intention was never to take advantage.

I'm not smart but I promise I am nice,

Charlotte

jakarman · Posted 02-06-2015 12:18 PM

Some explanation on the informat approach. It may be not the fastest but has a lot of flexibility.
Imagine you could define several of thos and just changing that format-name. By that you could combine any conversion in one dataset.

The credit risk ratio-s are using different values you could combine all those in one format.

/* http://support.sas.com/documentation/cdl/en/proc/67327/HTML/default/viewer.htm#p1pmw90bl3jzgdn1w4202... */
proc format ;
   invalue lkp_rate
      "1", "Aaa" ,"AAA" = 0.2
      "2+", "Aa1" ,"AA+" = 0.2
      "2", "Aa2" ,"AA"   = 0.2
      "2-", "Aa3" ,"AA-" = 0.2
      "3+", "A1" ,"A+"   = 0.5
      "3", "A2" ,"A"    = 0.5
      "3-", "A3" , "A-" = 0.5
      "4+", "Baa1","BBB+" = 1.0
      "4", "Baa2","BBB" = 1.0
      "4-", "Baa3","BBB-" = 1.0
      "5+", "Ba1" ,"BB+" = 1.0
      "5", "Ba2" ,"BB"   = 1.0
      "5-", "Ba3" ,"BB-" = 1.0
      "6+", "B1" ,"B+"   = 1.5
      "6", "B2" ,"B"    = 1.5
      "6-", "B3" ,"B-"   = 1.5
      "7+", "Caa1" ,"CCC+"   = 1.5
      "7", "Caa2" ,"CCC"    = 1.5
      "7-", "Caa3" ,"CCC-" = 1.5
      "8", "9", "10", "NR" =.
      other= _ERROR_
;

A select when otherwise construct is a nice understandable one to make decisions.

You calculation on that array is having sum of - min nmiss of constructs as of dealing with not present values.

Each construct is better to verify with a small datasets as used here validating the results in a design with test/verify approach.
It helps when we know what you know and what not so we can put answers in that you can easily getting to understand.
Sometimes we are going too fast and too technical. (yep, some ego game)

---->-- ja karman --<-----

FriedEgg · Posted 02-07-2015 01:38 AM

,

Please review the following. I included some brief comments to hopefully explain along the way what is happening. This will process the multiple market_types together. Feel free to ask any followup questions you may have.

data one;

infile cards truncover;

length Market_type $ 9;

input id $ (Q_RATING R_RATING S_RATING) (:$7.);

do market_type='Retail', 'Wholesale', 'SME';

output;

end;

cards4;

900129

2000425

900135 Baa1 A-

2000328

900152 Baa1 NR A

900265 Ba1 BBB-

900005 Baa1 A+

900108 Baa2 BBB+ A

900121 A1 A A+

900008 A2 A+

6964444 Baa3 BB+

900149 BBB+

900124

000100-502

900153 A2 A+

2000315

900136 Baa1 BBB+

900122 Aa3 AA- AA-

900155 A3 BBB+ BBB+

1103127

900185 A1 A+

900151 Baa1

900172 A2 A- A

900007 A- A

99900096 Aa3 A+ A+

900216 A2 NR A+

99900169 Baa1 BBB+ A+

900016 A1 A- A+

900385 Aa3 A+ AA-

;;;;

data two;

infile cards truncover;

input Rank (Market_type Rating_LKUP Q_RATING R_RATING S_RATING R_PCT) (:$9.);

cards4;

24 Retail UR Unrated Unrated Unrated -%

23 Retail 1 Aaa AAA AAA 20%

22 Retail 2+ Aa1 AA+ AA+ 20%

21 Retail 2 Aa2 AA AA 20%

20 Retail 2- Aa3 AA- AA- 20%

19 Retail 3+ A1 A+ A+ 50%

18 Retail 3 A2 A A 50%

17 Retail 3- A3 A- A- 50%

16 Retail 4+ Baa1 BBB+ BBB+ 100%

15 Retail 4 Baa2 BBB BBB 100%

14 Retail 4- Baa3 BBB- BBB- 100%

13 Retail 5+ Ba1 BB+ BB+ 100%

12 Retail 5 Ba2 BB BB 100%

11 Retail 5- Ba3 BB- BB- 100%

10 Retail 6+ B1 B+ B+ 150%

9 Retail 6 B2 B B 150%

8 Retail 6- B3 B- B- 150%

7 Retail 7+ Caa1 CCC+ CCC+ 150%

6 Retail 7 Caa2 CCC CCC 150%

5 Retail 7- Caa3 CCC- CCC- 150%

4 Retail 8

3 Retail 9

2 Retail 10

24 SME UR Unrated Unrated Unrated -%

23 SME 1 Aaa AAA AAA 30%

22 SME 2+ Aa1 AA+ AA+ 30%

21 SME 2 Aa2 AA AA 30%

20 SME 2- Aa3 AA- AA- 30%

19 SME 3+ A1 A+ A+ 60%

18 SME 3 A2 A A 60%

17 SME 3- A3 A- A- 60%

16 SME 4+ Baa1 BBB+ BBB+ 110%

15 SME 4 Baa2 BBB BBB 110%

14 SME 4- Baa3 BBB- BBB- 110%

13 SME 5+ Ba1 BB+ BB+ 110%

12 SME 5 Ba2 BB BB 110%

11 SME 5- Ba3 BB- BB- 110%

10 SME 6+ B1 B+ B+ 160%

9 SME 6 B2 B B 160%

8 SME 6- B3 B- B- 160%

7 SME 7+ Caa1 CCC+ CCC+ 160%

6 SME 7 Caa2 CCC CCC 160%

5 SME 7- Caa3 CCC- CCC- 160%

4 SME 8

3 SME 9

2 SME 10

24 Wholesale UR Unrated Unrated Unrated -%

23 Wholesale 1 Aaa AAA AAA 10%

22 Wholesale 2+ Aa1 AA+ AA+ 10%

21 Wholesale 2 Aa2 AA AA 10%

20 Wholesale 2- Aa3 AA- AA- 10%

19 Wholesale 3+ A1 A+ A+ 40%

18 Wholesale 3 A2 A A 40%

17 Wholesale 3- A3 A- A- 40%

16 Wholesale 4+ Baa1 BBB+ BBB+ 90%

15 Wholesale 4 Baa2 BBB BBB 90%

14 Wholesale 4- Baa3 BBB- BBB- 90%

13 Wholesale 5+ Ba1 BB+ BB+ 90%

12 Wholesale 5 Ba2 BB BB 90%

11 Wholesale 5- Ba3 BB- BB- 90%

10 Wholesale 6+ B1 B+ B+ 140%

9 Wholesale 6 B2 B B 140%

8 Wholesale 6- B3 B- B- 140%

7 Wholesale 7+ Caa1 CCC+ CCC+ 140%

6 Wholesale 7 Caa2 CCC CCC 140%

5 Wholesale 7- Caa3 CCC- CCC- 140%

4 Wholesale 8

3 Wholesale 9

2 Wholesale 10

;;;;

data flip_two;

set two;

by market_type descending rank;

array r[3] Q_RATING R_RATING S_RATING;

drop i;

do i=1 to dim(r);

rating_type = vname(r);

rating = r;

output;

end;

drop Rating_LKUP Q_RATING R_RATING S_RATING;

run;

data want;

*initialize variables for hash, does not actually set;

if 0 then set one flip_two;

*define our hash lookup;

*parameters :: market_type, rating_type and rating;

*return :: rank, r_pct;

declare hash rlkup(dataset:'flip_two');

rlkup.definekey('market_type', 'rating_type', 'rating');

rlkup.definedata('rank', 'r_pct');

rlkup.definedone();

*define hash rates, used for keeping rank and r_pct;

*use ordered ascending to help find middle/min later;

*use multidata to avoid collision on equal ranks for different rating_types;

declare hash rates(ordered:'a', multidata:'y');

rates.definekey('rank');

rates.definedata('r_pct');

rates.definedone();

*use the iterator to move through rates array and find min/middle;

declare hiter iter;

*create loop to process data from ds :: one;

do until(done);

set one end=done;

*this array contains each rating to loop over and lookup rank and r_pct from hash;

array h[3] Q_RATING R_RATING S_RATING;

drop i;

do i=1 to dim(h);

*we manually define keys to get rating_type and rating from h array;

rc = rlkup.find(key: market_type, key: vname(h), key: h);

*add the rank and r_pct to rates hash;

rc = rates.add();

end;

*the number of missing ratings from input dataset :: one;

miss = cmiss(of h

);

*if all missing then get Unrated r_pct, you may need to make changes here besed on your criteria;

if miss=3 then rc = rlkup.find(key:market_type, key:'Q_RATING', key:'Unrated');

else do;

*create new iterator object to move through rates hash;

iter = _new_ hiter('rates');

*move through rates hash and find min/middle rank;

*min is first because rates hash sorted 'd';

iter.first();

*middle if miss=0 get next after first of iter;

if miss=0 then iter.next();

*to clear the rates hash we need to delete the iter;

iter.delete();

end;

output;

*clear the hash for next row;

rates.clear();

end;

keep id market_type q_rating r_rating s_rating rank r_pct;

stop;

run;

CharlotteCain · Posted 02-09-2015 11:43 AM

Good evening Sir, I hope you had a good weekend while it's been freezing cold in Liverpool, England.Thank you so much yet again for your amazingly elite style approach to the problem. Wish, you were in England :smileyblush:

I tested your code and the comments associated with it are very comfortable to follow and comprehend. It successfully picks the middle rating for 4th criteria however for other scenarios it is picking the higher rating or rank rather than the lower rating.

For example, here is the output:

Retail	900135	Baa1	A-	4	50%
Wholesale	900135	Baa1	A-	4	40%
SME	900135	Baa1	A-	4	60%

So, for ID 900135 the R_pct for the above on picking the lower rating should be 100%,90% and 110% for retail, Wholesale and SME respectively. Other than that change, It's simply very elegant.

Trust it's not as cold in PA,

Charlotte

jakarman · Posted 02-07-2015 10:31 AM

This is my code. Goal to be understandable for modifications, not optimized for performance. But is able to run on all kind of combined data.

I needed to add some update in the format. With mattews data example easy to review/debug (I hope).

proc format ;

invalue lkp_rate

"1", "Aaa" ,"AAA" = 0.2

"2+", "Aa1" ,"AA+" = 0.2

"2", "Aa2" ,"AA" = 0.2

"2-", "Aa3" ,"AA-" = 0.2

"3+", "A1" ,"A+" = 0.5

"3", "A2" ,"A" = 0.5

"3-", "A3" , "A-" = 0.5

"4+", "Baa1","BBB+" = 1.0

"4", "Baa2","BBB" = 1.0

"4-", "Baa3","BBB-" = 1.0

"5+", "Ba1" ,"BB+" = 1.0

"5", "Ba2" ,"BB" = 1.0

"5-", "Ba3" ,"BB-" = 1.0

"6+", "B1" ,"B+" = 1.5

"6", "B2" ,"B" = 1.5

"6-", "B3" ,"B-" = 1.5

"7+", "Caa1" ,"CCC+" = 1.5

"7", "Caa2" ,"CCC" = 1.5

"7-", "Caa3" ,"CCC-" = 1.5

"8", "9", "10", "NR" ," " =.

other= _ERROR_

;

invalue lkp_miss

other= . ;

run;

/* http://support.sas.com/documentation/cdl/en/lefunctionsref/67398/HTML/default/viewer.htm#n1xdirhugd3... */

%let ratings=3;

data x ( drop=_i ) ;

set one ;

array a

rating1-rating&ratings. ;

array b[&ratings.] /* _temporary_ */ ; /* all values must be set as being retained. */

Select ; /* determine the needed lookup for different types objects dates or whatever */

When ( market_type = "retail" | market_type ~= "retail" ) lookupfmt= "lkp_rate." ;

Otherwise lookupfmt= "lkp_miss." ;

end;

do _i=1 to dim(a) ;

b[_i]=inputn( a[_i], lookupfmt );

end;

/* set some handy values */

r_miss=nmiss( of b

) ;

r_valid=dim(a)-r_miss ;

/* go for the needed rpct value and way of getting a combination */

Select; /* missings on front - with 1 valid value=3 with 2,3 valid 2 */

When ( r_valid = 3 ) _med=2;

otherwise _med=3;

end;

r_pctmin=min(of b

);

Call sortn( of b

) ;

r_pctmed=b[_med] ;

Select ; /* choose your type of pecentage */

When ( market_type = "retail" | market_type ~= "retail" ) Do;

r_pct=r_pctmin ;

End;

Otherwise ;

end;

run;

---->-- ja karman --<-----

jakarman · Posted 02-09-2015 03:13 PM

I tired to make the process as understandable as possible.
May be there are some mistakes as not fully understood the requirements or having missed some loopholes.
That is why testing validating as much as possible with small sized understandable data can be that important.

I think you can adjust things and make it more complicated when needed for some reason. There is no follow up needed on this I assume.

---->-- ja karman --<-----

Re: Requesting efficent look up solution for a large dataset?

Re: Requesting efficent look up solution for a large dataset?

Re: Requesting efficent look up solution for a large dataset?

Re: Requesting efficent look up solution for a large dataset?

Re: Requesting efficent look up solution for a large dataset?

Re: Requesting efficent look up solution for a large dataset?

Re: Requesting efficent look up solution for a large dataset?

Re: Requesting efficent look up solution for a large dataset?

Re: Requesting efficent look up solution for a large dataset?

Re: Requesting efficent look up solution for a large dataset?

Re: Requesting efficent look up solution for a large dataset?

Re: Requesting efficent look up solution for a large dataset?

Re: Requesting efficent look up solution for a large dataset?

Registration is open

Registration is open

SAS Training: Just a Click Away