Hi all,
I’ve been trying to translate several SAS MERGE passesinto HASH programing. These MERGE passes are being run on very large datasets and take forever to complete. I’ve been reading though many white paperss and I'm having trouble (a lot of trouble) understanding the Hash syntax. Any help on translating the examples below will be greatly appreciated.
Thanks so much!
/*********EXAMPLE 1*******************/ proc sort data= category out= qtr_trans1(keep=mstr_customer_id td_date); by mstr_customer_id td_date;run; proc sort data= qtr_trans1(rename=(td_date=first_td_date)) nodupkey; by mstr_customer_id;run; proc sort data= category; by mstr_customer_id;run; data category1; format first_td_date mmddyy10.; merge category qtr_trans1; by mstr_customer_id; no_of_days=td_date-first_td_date; if 0 <= no_of_days <= 90 then qtr= 1 ; else if 91 <= no_of_days <= 180 then qtr= 2 ; else if 181 <= no_of_days <= 270 then qtr= 3 ; else if 271 <= no_of_days <= 360 then qtr= 4 ; else if 361 <= no_of_days <= 450 then qtr= 5 ; else if 451 <= no_of_days <= 540 then qtr= 6 ; else if 541 <= no_of_days <= 630 then qtr= 7 ; else if 631 <= no_of_days <= 720 then qtr= 8 ; else if 721 <= no_of_days <= 810 then qtr= 9 ; else if 811 <= no_of_days <= 900 then qtr= 10 ; else if 901 <= no_of_days <= 990 then qtr= 11 ; else if 991 <= no_of_days <= 1080 then qtr= 12 ; else if 1081 <= no_of_days <= 1170 then qtr= 13 ; else qtr=99; run; /********EXAMPLE 2**************************/ proc sort data=PULL_PROMO;by mstr_customer_id;run; proc sort data=check;by mstr_customer_id;run; data test1; merge PULL_PROMO(in=a) check(in=b) ; by mstr_customer_id; if a and b then output; run; /********EXAMPLE 3**************************/ proc sort data= t_avg_time_bw_pur_catp nodupkey;by mstr_customer_id;run; proc sort data= t_avg_time_bw_ret_catr nodupkey;by mstr_customer_id;run; proc sort data= cat_11 nodupkey;by mstr_customer_id;run; proc sort data= cat_12 nodupkey;by mstr_customer_id;run; proc sort data= cat_trans_merge nodupkey;by mstr_customer_id;run; proc sort data= t_cat_sale_pur_ nodupkey;by mstr_customer_id;run; data email_trans; merge t_avg_time_bw_pur_catp(in=b) t_avg_time_bw_ret_catr(in=c) cat_11(in=d) cat_12(in=e) cat_trans_merge(in=h) t_cat_sale_pur_ (in=i) ; by mstr_customer_id;run;
You will likely get better suggestions if you provide some example of the input data and then what the result should look like.
I'm not completely sure but I suspect that Proc SQL joins may work.
BTW code like this
if 0 <= no_of_days <= 90 then qtr= 1 ; else if 91 <= no_of_days <= 180 then qtr= 2 ; else if 181 <= no_of_days <= 270 then qtr= 3 ; else if 271 <= no_of_days <= 360 then qtr= 4 ; else if 361 <= no_of_days <= 450 then qtr= 5 ; else if 451 <= no_of_days <= 540 then qtr= 6 ; else if 541 <= no_of_days <= 630 then qtr= 7 ; else if 631 <= no_of_days <= 720 then qtr= 8 ; else if 721 <= no_of_days <= 810 then qtr= 9 ; else if 811 <= no_of_days <= 900 then qtr= 10 ; else if 901 <= no_of_days <= 990 then qtr= 11 ; else if 991 <= no_of_days <= 1080 then qtr= 12 ; else if 1081 <= no_of_days <= 1170 then qtr= 13 ; else qtr=99;
is a good candiate for custom formats or informats, especially if used frequently.
proc format library=work; invalue days2qtr 0 - 90 = 1 91 - 180 = 2 181 - 270 = 3 271 - 360 = 4 361 - 450 = 5 451 - 540 = 6 541 - 630 = 7 631 - 720 = 8 721 - 810 = 9 811 - 900 = 10 901 - 990 = 11 991 - 1080 = 12 1081- 1170 = 13 other =99; run; Data junk; do days = 10 to 1200 by 90; qtr= input(put(days,best5.-L),days2qtr.); output; end; run;
Since the input wants a character to read the put bit creates a string from the numeric days and shifts it left for consistent reading. The -L likely isn't needed in this case but when using an informat like F4.2 can be critical.
Hi @RobertNYC Like Ballard pointed out, If you could provide some input sample and an output sample of your want will certainly make it convenient to discuss the best approach fit for your problem. Looking at your datastep, use of format with cntlin and cntlout seems to be a better approach than hash in my opinion.
I do not disagree with your choice of hash however I'd like to analyse different approaches, ease of coding and maintanence and so on before deciding what to implement.
@ballardw elegently pointed out emphasizing the application of formats when you have look up values in the form of ranges. I believe, Formats work so much more better with ranges than hashes. I'll await for more responses as I might be wrong but I severely doubt it though. Thanks!
I agree absolutely with everything that's been said so far but I would add that while hash merging is undoubtedly faster that standard merge statements for very large files you need to be a bit careful as the data needs to be read into memory and you can encounter failures if you run out of memory while hash merging (it's occasionally happened to me).
Good Point @ChrisBrooks.
@RobertNYC, if this is a big data problem, it might be worth taking the time to write some code that will simulate some data roughly the size of the tables you want to merge. If you post that code, instead of sample data, people could run it, and then play with developing different solutions (merge vs sql vs hand written hash etc). There are plenty of tradeoffs to consider. In my experience merge is actually very fast, but you pay for the sort. Indexing is another possibility to consider.
It might be that your first example was oversimplified but it seems to me that it doesn't require three sorts and a merge. Try this instead:
proc sort data=category;
by mstr_customer_id td_date;
run;
data category1;
format first_td_date mmddyy10.;
set category;
by mstr_customer_id;
retain first_td_date;
if first.mstr_customer_id then first_td_date = td_date;
no_of_days = intck("day", first_td_date, td_date);
qtr = ceil(max(1, no_of_days) / 90);
if qtr > 13 then qtr = 99;
run;
(untested)
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.