Hey guys, I currently have problems with my macro in terms of performance issues.
I have the following 2 Tables:
data a;
infile datalines dlm='/';
input (Date cusip fundid value) ($);
datalines;
Mar2012/ 123 / A / 5$
Mar2012/ 123/ B / 6$
Apr2012 / 124 / B / 6$
May2012/ 125/ C/ 2$
Jun2012/ 123/ D/ 4$
;
data b;
infile datalines dlm='/';
input (Date cusip value mod) ($);
datalines;
Mar2012/ 123 / 4$ / 6$
May2012/ 125/ 3$ / 7$;
I just simply want to make a inner join of the both datasets.
Additionally, we have the following characteristics:
Table a is a lookup-table which means that every observation of table b exists at least once in table a. But table a
can have observations which table b does not have.
Furthermore, there is a one-to-many-relationship: Table a can have many representations with the same cusip/date combination and table b only have unique observations.
So, my wanted dataset looks like this (classValue is the column mod of table b, but renamed). I just want to output all date/cusip observations of table a that match to table b and additionally add the value mod of table b (and rename it as classValue).
data want;
infile datalines dlm='/';
input (date cusip fundid value classValue) ($);
datalines;
Mar2012/ 123 / A / 5$ / 6$
Mar2012/ 123/ B / 6$ / 6$
May2012/ 125/ C/ 2$ / 7$
;
My current solution looks like this (both datasets are already sorted by date and cusip)
data want;
merge a (in=a) b(in=b keep=date cusip_hld mod rename=(mod=classValue));
by date cusip;
if a and b;
run;
Unfortunately, the performance is quite bad.
I guess, since table a is a lookup-table, a hash solution could probably improve the performance a lot.
Could someone kindly provide me a hash solution for my problem?
Thanks in advance 🙂
data a;
infile datalines dlm='/';
input (Date cusip fundid value) ($);
datalines;
Mar2012/ 123 / A / 5$
Mar2012/ 123/ B / 6$
Apr2012 / 124 / B / 6$
May2012/ 125/ C/ 2$
Jun2012/ 123/ D/ 4$
;
data b;
infile datalines dlm='/';
input (Date cusip value mod) ($);
datalines;
Mar2012/ 123 / 4$ / 6$
May2012/ 125/ 3$ / 7$
;
data want;
if _n_=1 then do;
if 0 then set b(drop=value);
dcl hash H (dataset:'b(drop=value)',multidata:'y') ;
h.definekey ("Date","cusip") ;
h.definedata (all:'y') ;
h.definedone () ;
end;
set a;
if h.find()=0;
run;
Data sets are not clear to enable easy copy paste. Can you paste verified your data values as plain text
data a;
infile datalines dlm='/';
input (Date cusip fundid value) ($);
datalines;
Mar2012/ 123 / A / 5$
Mar2012/ 123/ B / 6$
Apr2012 / 124 / B / 6$
May2012/ 125/ C/ 2$
Jun2012/ 123/ D/ 4$
;
data b;
infile datalines dlm='/';
input (Date cusip value mod) ($);
datalines;
Mar2012/ 123 / 4$ / 6$
May2012/ 125/ 3$ / 7$
;
data want;
if _n_=1 then do;
if 0 then set b(drop=value);
dcl hash H (dataset:'b(drop=value)',multidata:'y') ;
h.definekey ("Date","cusip") ;
h.definedata (all:'y') ;
h.definedone () ;
end;
set a;
if h.find()=0;
run;
Sry mate for the bad representation of my data. Thank you sooooooooooooo much for your quick help.
I´m so excited about how much the performance can be improved with that.
You always help me, how can I support you for that?
For completeness:
data a;
input (Date cusip fundid value) ($);
datalines;
Mar2012 123 A 5$
Mar2012 123 B 6$
Apr2012 124 B 6$
May2012 125 C 2$
Jun2012 123 D 4$
;
data b;
input (Date cusip value mod) ($);
datalines;
Mar2012 123 4$ 6$
May2012 125 3$ 7$
;
data want;
input (Date cusip fundid value classValue) ($);
datalines;
Mar2012 123 A 5$ 6$
Mar2012 123 B 6$ 6$
May2012 125 C 2$ 7$
;
We are here to share knowledge, so no need to thank or be formal. I always enjoyed your enthuse. The same code works for your recent data above
data want;
if _n_=1 then do;
if 0 then set b(drop=value);
dcl hash H (dataset:'b(drop=value)',multidata:'y') ;
h.definekey ("Date","cusip") ;
h.definedata (all:'y') ;
h.definedone () ;
end;
set a;
if h.find()=0;
run;
Also @mrzlatan91 Someday we shall have paulaner dunkel dark beer at a nice pub. 🙂 Have fun.
@novinosrinYeah, this would be nice.
Tell me when you´re in Germany 🙂
You may also need to consider memory size when using Hash tables. If your table is very large then you might encounter insufficient memory for hash table processing, you might need to subset the dataset in this case.
@SuryaKiran thanks for your advice. Is there a way to find out how much memory is used and available?
options fullstimer; --> This will put some stats in your log. and to find your system memory available use proc options option=memory;run; or in a macro %let memsize=%sysfunc(getoption(MEMSIZE));
Thank you,
I have tested it now. Unfortunately, the Hash solution is slower than the merge solution.
I´m testing it on a working station and dont have any memory problems (max. used memory is about 20000.00k, OS memory is about 400000.00k). I dont really understand why the hash solution is slower.
Is there another possibility to improve the performance (e.g. using more memory) ?
I have already loaded one of the datasets in the memory (via sasfile).
Can indexed merge help?
data a;
infile datalines dlm='/';
input (Date cusip fundid value) ($);
datalines;
Mar2012/ 123 / A / 5$
Mar2012/ 123/ B / 6$
Apr2012 / 124 / B / 6$
May2012/ 125/ C/ 2$
Jun2012/ 123/ D/ 4$
;
data b;
infile datalines dlm='/';
input (Date cusip value mod) ($);
datalines;
Mar2012/ 123 / 4$ / 6$
May2012/ 125/ 3$ / 7$
;
proc datasets library=work nolist;
modify b;
index create Date cusip / unique;
modify a;
index create keyname=(Date cusip);
quit;
data want;
merge a (in=a) b(in=b keep=date cusip mod rename=(mod=classValue));
by Date cusip;
if a and b;
run;
@novinosrinsry for the late answer.
I get an error message that double values for date / cusip are not valid in table b. But Its not possible that I have double values in it.
I have also tested 2 different SQL statements with select date, cusip vS. select distinct date, cusip and get the same amount of observations.
@mrzlatan91 if the thread has been answered, you can mark the question as answered
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.