BookmarkSubscribeRSS Feed
RobertNYC
Obsidian | Level 7

Hi all,

 

I’ve been trying to translate several SAS MERGE passesinto HASH programing.  These MERGE passes are being run on very large datasets and take forever to complete.  I’ve been reading though many white paperss and I'm  having trouble (a lot of trouble) understanding the Hash syntax.  Any help on translating the examples below will be greatly appreciated.

 

Thanks so much! 

 

 

 

 

/*********EXAMPLE  1*******************/


proc sort data= category out= qtr_trans1(keep=mstr_customer_id td_date);
by mstr_customer_id td_date;run;
proc sort data= qtr_trans1(rename=(td_date=first_td_date)) nodupkey;
by mstr_customer_id;run;
proc sort data= category;
by mstr_customer_id;run;

data  category1;
format first_td_date mmddyy10.;
merge  category  qtr_trans1;
by mstr_customer_id;
no_of_days=td_date-first_td_date;
     if     0   <= no_of_days <=        90      then qtr=       1       ;
else if    91   <= no_of_days <=        180     then qtr=       2       ;
else if   181   <= no_of_days <=        270     then qtr=       3       ;
else if   271   <= no_of_days <=        360     then qtr=       4       ;
else if   361   <= no_of_days <=        450     then qtr=       5       ;
else if   451   <= no_of_days <=        540     then qtr=       6       ;
else if   541   <= no_of_days <=        630     then qtr=       7       ;
else if   631   <= no_of_days <=        720     then qtr=       8       ;
else if   721   <= no_of_days <=        810     then qtr=       9       ;
else if   811   <= no_of_days <=        900     then qtr=       10      ;
else if   901   <= no_of_days <=        990     then qtr=       11      ;
else if   991   <= no_of_days <=        1080 then qtr=  12      ;
else if   1081  <= no_of_days <=        1170 then qtr=  13      ;
else qtr=99;
run;


/********EXAMPLE  2**************************/

proc sort data=PULL_PROMO;by mstr_customer_id;run;
proc sort data=check;by mstr_customer_id;run;
data test1;
merge PULL_PROMO(in=a) check(in=b) ;
by mstr_customer_id;
if a and b then output;
run;


/********EXAMPLE  3**************************/

proc sort data= t_avg_time_bw_pur_catp nodupkey;by mstr_customer_id;run;
proc sort data= t_avg_time_bw_ret_catr nodupkey;by mstr_customer_id;run;
proc sort data= cat_11 nodupkey;by mstr_customer_id;run;
proc sort data= cat_12 nodupkey;by mstr_customer_id;run;
proc sort data= cat_trans_merge nodupkey;by mstr_customer_id;run;
proc sort data= t_cat_sale_pur_  nodupkey;by mstr_customer_id;run;

data email_trans;
merge
 t_avg_time_bw_pur_catp(in=b) t_avg_time_bw_ret_catr(in=c) cat_11(in=d) cat_12(in=e) 
cat_trans_merge(in=h) t_cat_sale_pur_ (in=i) ;
by mstr_customer_id;run;

 

 

5 REPLIES 5
ballardw
Super User

You will likely get better suggestions if you provide some example of the input data and then what the result should look like.

 

I'm not completely sure but I suspect that Proc SQL joins may work.

 

BTW code like this

     if     0   <= no_of_days <=        90      then qtr=       1       ;
else if    91   <= no_of_days <=        180     then qtr=       2       ;
else if   181   <= no_of_days <=        270     then qtr=       3       ;
else if   271   <= no_of_days <=        360     then qtr=       4       ;
else if   361   <= no_of_days <=        450     then qtr=       5       ;
else if   451   <= no_of_days <=        540     then qtr=       6       ;
else if   541   <= no_of_days <=        630     then qtr=       7       ;
else if   631   <= no_of_days <=        720     then qtr=       8       ;
else if   721   <= no_of_days <=        810     then qtr=       9       ;
else if   811   <= no_of_days <=        900     then qtr=       10      ;
else if   901   <= no_of_days <=        990     then qtr=       11      ;
else if   991   <= no_of_days <=        1080 then qtr=  12      ;
else if   1081  <= no_of_days <=        1170 then qtr=  13      ;
else qtr=99;

is a good candiate for custom formats or informats, especially if used frequently.

 

proc format library=work;
invalue days2qtr
   0 -   90   =  1      
  91 -   180  =  2      
 181 -   270  =  3      
 271 -   360  =  4      
 361 -   450  =  5      
 451 -   540  =  6      
 541 -   630  =  7      
 631 -   720  =  8      
 721 -   810  =  9      
 811 -   900  =  10     
 901 -   990  =  11     
 991 -   1080 =  12      
 1081-   1170 =  13      
other =99;
run;

Data junk;
   do days = 10 to 1200 by 90;
      qtr= input(put(days,best5.-L),days2qtr.);
      output;
   end;
run;

Since the input wants a character to read the put bit creates a string from the numeric days and shifts it left for consistent reading. The -L likely isn't needed in this case but when using an informat like F4.2 can be critical.

 

novinosrin
Tourmaline | Level 20

Hi @RobertNYC Like Ballard pointed out, If you could provide some input sample and an output sample of your want will certainly make it convenient to discuss the best approach fit for your problem. Looking at your datastep, use of format with cntlin and cntlout seems to be a better approach than hash in my opinion. 

 

I do not disagree with your choice of hash however I'd like to analyse different approaches, ease of coding and maintanence and so on before deciding what to implement.

 

@ballardw elegently pointed out emphasizing the application of formats when you have look up values in the form of ranges. I believe, Formats work so much more better with ranges than hashes. I'll await for more responses as I might be wrong but I severely doubt it though. Thanks!  

ChrisBrooks
Ammonite | Level 13

I agree absolutely with everything that's been said so far but I would add that while hash merging is undoubtedly faster that standard merge statements for very large files you need to be a bit careful as the data needs to be read into memory and you can encounter failures if you run out of memory while hash merging (it's occasionally happened to me).

Quentin
Super User

Good Point @ChrisBrooks.

 

@RobertNYC, if this is a big data problem, it might be worth taking the time to write some code that will simulate some data roughly the size of the tables you want to merge.  If you post that code, instead of sample data, people could run it, and then play with developing different solutions (merge vs sql vs hand written hash etc).  There are plenty of tradeoffs to consider.   In my experience merge is actually very fast, but you pay for the sort. Indexing is another possibility to consider.

BASUG is hosting free webinars Next up: Jane Eslinger presenting PROC REPORT and the ODS EXCEL destination on Mar 27 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
PGStats
Opal | Level 21

It might be that your first example was oversimplified but it seems to me that it doesn't require three sorts and a merge. Try this instead:

 

proc sort data=category;
by mstr_customer_id td_date;
run;

data  category1;
format first_td_date mmddyy10.;
set category;
by mstr_customer_id;
retain first_td_date;
if first.mstr_customer_id then first_td_date = td_date;
no_of_days = intck("day", first_td_date, td_date);
qtr = ceil(max(1, no_of_days) / 90);
if qtr > 13 then qtr = 99;
run;

(untested)

 

PG

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 829 views
  • 1 like
  • 6 in conversation