Solved: Re: Performance Issues with merging (Hash Solution needed)

mrzlatan91 · Posted 09-10-2018 03:28 PM

Hey guys, I currently have problems with my macro in terms of performance issues.

I have the following 2 Tables:

data a;
infile datalines dlm='/';
input (Date cusip fundid value) ($);
datalines;

Mar2012/ 123 / A / 5$
Mar2012/ 123/ B / 6$
Apr2012 / 124 / B / 6$
May2012/ 125/ C/ 2$
Jun2012/ 123/ D/ 4$

;

data b;
infile datalines dlm='/';
input (Date cusip value mod) ($);
datalines;

Mar2012/ 123 / 4$ / 6$
May2012/ 125/ 3$ / 7$;

I just simply want to make a inner join of the both datasets.

Additionally, we have the following characteristics:

Table a is a lookup-table which means that every observation of table b exists at least once in table a. But table a

can have observations which table b does not have.

Furthermore, there is a one-to-many-relationship: Table a can have many representations with the same cusip/date combination and table b only have unique observations.

So, my wanted dataset looks like this (classValue is the column mod of table b, but renamed). I just want to output all date/cusip observations of table a that match to table b and additionally add the value mod of table b (and rename it as classValue).

data want;
infile datalines dlm='/';
input (date cusip fundid value classValue) ($);
datalines;
Mar2012/ 123 / A / 5$ / 6$
Mar2012/ 123/ B / 6$ / 6$
May2012/ 125/ C/ 2$ / 7$

;

My current solution looks like this (both datasets are already sorted by date and cusip)

data want;
merge a (in=a) b(in=b keep=date cusip_hld mod rename=(mod=classValue));
by date cusip;
if a and b;
run;

Unfortunately, the performance is quite bad.

I guess, since table a is a lookup-table, a hash solution could probably improve the performance a lot.

Could someone kindly provide me a hash solution for my problem?

Thanks in advance 🙂

novinosrin · Posted 09-10-2018 03:39 PM

data a;
infile datalines dlm='/';
input (Date cusip fundid value) ($);
datalines;
Mar2012/ 123 / A / 5$
Mar2012/ 123/ B / 6$
Apr2012 / 124 / B / 6$
May2012/ 125/ C/ 2$
Jun2012/ 123/ D/ 4$
;

 

data b;
infile datalines dlm='/';
input (Date cusip value mod) ($);
datalines;
Mar2012/ 123 / 4$ / 6$
May2012/ 125/ 3$ / 7$
;

data want;
if _n_=1 then do;
if 0 then set b(drop=value);
 dcl hash H (dataset:'b(drop=value)',multidata:'y') ;
   h.definekey  ("Date","cusip") ;
   h.definedata (all:'y') ;
   h.definedone () ;
end;
set a;
if h.find()=0;
run;

View solution in original post

novinosrin · Posted 09-10-2018 03:33 PM

Data sets are not clear to enable easy copy paste. Can you paste verified your data values as plain text

novinosrin · Posted 09-10-2018 03:39 PM

data a;
infile datalines dlm='/';
input (Date cusip fundid value) ($);
datalines;
Mar2012/ 123 / A / 5$
Mar2012/ 123/ B / 6$
Apr2012 / 124 / B / 6$
May2012/ 125/ C/ 2$
Jun2012/ 123/ D/ 4$
;

 

data b;
infile datalines dlm='/';
input (Date cusip value mod) ($);
datalines;
Mar2012/ 123 / 4$ / 6$
May2012/ 125/ 3$ / 7$
;

data want;
if _n_=1 then do;
if 0 then set b(drop=value);
 dcl hash H (dataset:'b(drop=value)',multidata:'y') ;
   h.definekey  ("Date","cusip") ;
   h.definedata (all:'y') ;
   h.definedone () ;
end;
set a;
if h.find()=0;
run;

mrzlatan91 · Posted 09-10-2018 03:50 PM

Sry mate for the bad representation of my data. Thank you sooooooooooooo much for your quick help.

I´m so excited about how much the performance can be improved with that.

You always help me, how can I support you for that?

For completeness:

data a;
input (Date cusip fundid value) ($);
datalines;
Mar2012 123 A 5$
Mar2012 123 B 6$
Apr2012 124 B 6$
May2012 125 C 2$
Jun2012 123 D 4$
;

 

data b;
input (Date cusip value mod) ($);
datalines;
Mar2012 123 4$ 6$
May2012 125 3$ 7$
;

data want;
input (Date cusip fundid value classValue) ($);
datalines;
Mar2012 123 A 5$ 6$
Mar2012 123 B 6$ 6$
May2012 125 C 2$ 7$
;

novinosrin · Posted 09-10-2018 04:11 PM

We are here to share knowledge, so no need to thank or be formal. I always enjoyed your enthuse. The same code works for your recent data above

data want;
if _n_=1 then do;
if 0 then set b(drop=value);
dcl hash H (dataset:'b(drop=value)',multidata:'y') ;
h.definekey ("Date","cusip") ;
h.definedata (all:'y') ;
h.definedone () ;
end;
set a;
if h.find()=0;
run;

Also @mrzlatan91 Someday we shall have paulaner dunkel dark beer at a nice pub. 🙂 Have fun.

mrzlatan91 · Posted 09-10-2018 04:41 PM

@novinosrinYeah, this would be nice.

Tell me when you´re in Germany 🙂

SuryaKiran · Posted 09-10-2018 03:51 PM

You may also need to consider memory size when using Hash tables. If your table is very large then you might encounter insufficient memory for hash table processing, you might need to subset the dataset in this case.

Thanks,
Suryakiran

mrzlatan91 · Posted 09-10-2018 04:44 PM

@SuryaKiran thanks for your advice. Is there a way to find out how much memory is used and available?

SuryaKiran · Posted 09-11-2018 09:22 AM

options fullstimer; --> This will put some stats in your log. and to find your system memory available use proc options option=memory;run; or in a macro %let memsize=%sysfunc(getoption(MEMSIZE));

Thanks,
Suryakiran

mrzlatan91 · Posted 09-11-2018 04:41 PM

Thank you,

I have tested it now. Unfortunately, the Hash solution is slower than the merge solution.

I´m testing it on a working station and dont have any memory problems (max. used memory is about 20000.00k, OS memory is about 400000.00k). I dont really understand why the hash solution is slower.

Is there another possibility to improve the performance (e.g. using more memory) ?

I have already loaded one of the datasets in the memory (via sasfile).

novinosrin · Posted 09-11-2018 05:09 PM

Can indexed merge help?

data a;
infile datalines dlm='/';
input (Date cusip fundid value) ($);
datalines;
Mar2012/ 123 / A / 5$
Mar2012/ 123/ B / 6$
Apr2012 / 124 / B / 6$
May2012/ 125/ C/ 2$
Jun2012/ 123/ D/ 4$
;

 

data b;
infile datalines dlm='/';
input (Date cusip value mod) ($);
datalines;
Mar2012/ 123 / 4$ / 6$
May2012/ 125/ 3$ / 7$
;


proc datasets library=work nolist;
 modify b;
 index create Date cusip / unique;
 modify a;
 index create keyname=(Date cusip);
quit;

data want;
 merge  a (in=a) b(in=b keep=date cusip mod rename=(mod=classValue));
 by Date cusip;
 if a and b;
 run;

mrzlatan91 · Posted 09-13-2018 03:47 PM

@novinosrinsry for the late answer.

I get an error message that double values for date / cusip are not valid in table b. But Its not possible that I have double values in it.

I have also tested 2 different SQL statements with select date, cusip vS. select distinct date, cusip and get the same amount of observations.

Andygray · Posted 09-10-2018 04:35 PM

@mrzlatan91 if the thread has been answered, you can mark the question as answered

Classroom Training Available!