03-24-2012 09:05 AM
Hi everyone, i have this code below and I would like to enhance it.
Any suggestion will help me so much. Maybe using hash would be better? If so, how to translate it to.
proc sort data=wk.seguros_carona&A4M2. ; by OCTROI ; run;
proc sort data=pfat.refcto3m&A4M2m1. ; by OCTROI ; run;
proc sort data=wk.pf_pfra_&A4M2. ; by OCTROI ; run;
proc sort data=mvt_pfra ; by OCTROI ; run;
merge wk.pf_&corte._cart&A4M2. (in=a)
wk.seguros_carona&A4M2. (in=_seguro )
pfat.refcto3m&A4M2m1. (in=_rfcto3 )
pfat.refctovd&A4M2m1. (in=_refinvd drop=TDPRNFOY TDPRNCONT)
wk.pf_pfra_&A4M2. (in=_pfra )
mvt_pfra (in=_mvt );
by OCTROI ;
if not _refinvd and intck('month',d_cre,&dtbdos) ge 12
and tvdchaine ne &submarino then publico=0;
else if not _refinvd then publico=1;
else if _rfcto3 then publico=3;
if tdprnsoc ne (100) then _filtro=1;
else if (talpcposdosa in ("560" "570" "CG1" "CG2" "RET" "DCD" "CTX" "USU")
or tdprnagcrcvt in (560 570)) then _filtro=3;
else if cposa ne 'ENC' then _filtro=4;
else if verrou in (45,55,52) or verrou ge 60 then _filtro=5;
else if tdcpcficp in (5) then _filtro=6;
else if 0<=todu <100 then _filtro=7;
else if todu <7500 then _filtro=8;
else if tvdchaine not in &submarino and (tdcpbmontssr12m gt 2 or
(tdcpbmontssr12m eq 1 and intck('month',d_dms,&dtbdos) lt 3) or
(tdcpbmontssr12m eq 2 and intck('month',d_dms,&dtbdos) lt 6) ) then _filtro=9;
else if tvdchaine eq &submarino and (d_dms ne . and
intck('month',d_dms,&dtbdos) lt 3) then _filtro=9.1;
else if modo_pagamento eq 'P' then _filtro=10;
else if DEGRESSIVIDADE ne 997 and (DEGRESSIVIDADE<901 or DEGRESSIVIDADE>950) then _filtro=11;
else if tvdchaine eq 9100280 then _filtro=13;
else if intck('month',d_pcp,&dtbdos) lt 1 and
tvdchaine not in &submarino then _filtro=14;
else if _seguro then _filtro=15;
else if atpfra ge 1 or _mvt then _filtro=16;
else if tvdchaine ne 9106782 and GRENT = "Nao Abordados" then _filtro=17;
else if tvdchaine eq 9106782 and publico in (0,1) and NO_SUBA in (201) then _filtro=18;
else if tvdchaine eq 9106782 and publico in (2) and NO_SUBA in (103) then _filtro=18;
else if tvdchaine eq 9106782 and publico in (3) and NO_SUBA in (103,146,201) then _filtro=18;
Thanks so much.
03-24-2012 11:51 AM
A couple of thoughts.
1) publico does not need to be RETAINed, as it is assigned for each pass through the DATA step.
2) the second IF-THEN-ELSE sequence might have it's performance by rearranging it to have the highest frequency values of _filtro come first.
03-24-2012 12:00 PM
I've these questions?
Would you recommend to make merge with many data sets for this case?
Could it be better using proc sql, considering that I would not need to sort the data sets?
Thanks for the answers.
03-24-2012 12:10 PM
You could replace the MERGE with a PROC SQL. The SORT is done in the backround by SQL, so it still gets done.
You would be hard pressed to do the second IF-THEN-ELSE sequence in SQL. You could, but it would take a bit of work. If you don't, it's another pass through the data.
Coming up with 'best performace' has so many varaibles to consider that we are unlikely to hit it just as an abstract process.
03-24-2012 11:55 AM
What you have created is a very reasonable solution to the problem. Merging data can always be troublesome.
I don't see anything in it that is inherently slow.
How many records are in each of your input files?
Which steps do you feel are taking too long?
A couple of suggestions, but I doubt they'll help much.
If there are a lot of variables in your input datasets that you won't need, remove them coming into the sorts, keeping just the variables you need.
You might see some slight improvements from your data step by reordering the if statements so that the options that occur most frequently are tested for first.
I believe that you'd only see a benefit from using hash tables if your sort is too slow. Usually SAS sorts data very quickly.
Another option is to replace the sort/merge process with a PROC SQL join, but as the underlying process will be very similar, I don't see major improvements. Maybe someone else has experience with comparing sort/merge with join.
03-24-2012 12:28 PM
One other performance tool that may be applicable is the use of FORMATs for subsetting. If the small datasets just have a few variables that are used in the subsetting, then you may be able to convert some of the variables into formats for the subsetting.
03-24-2012 12:37 PM
Thanks Doc@Duke and Tom. I'm considering your suggestions. What about if I create one data set with merge/join taking all variable needed and make the conditional later? Maybe using FORMAT.
Could it be a better solution?
05-17-2012 07:05 PM
Try to use hash tables and use option fullstimer to watch memory usage.
When you have speed issues, only discard the hash table option when you have tested and are sure you don't have enough memory. In this case, one large table and few small ones might well fit in memory.
More RAM might be a cheap and effective improvement in your case if you are not limited by the OS. Another tiny hardware change that will make a huge difference on things like sorts is to use solid state storage for the work area. Highly recommended.
05-18-2012 07:15 AM
Shivas, have you ever tried to merge large unsorted tables by index? It is way slower than sorting in my experience.
If memory doesn't allow hash tables, I reckon formats is the next thing to try.