Help using Base SAS procedures

The best performance for the code below

Reply
Frequent Contributor
Posts: 80

The best performance for the code below

Hi everyone, i have this code below and I would like to enhance it.

Any suggestion will help me so much. Maybe using hash would be better? If so, how to translate it to.

proc sort data=wk.seguros_carona&A4M2.  ; by OCTROI ; run;

proc sort data=pfat.refcto3m&A4M2m1.      ; by OCTROI ; run;

proc sort data=wk.pf_pfra_&A4M2.            ; by OCTROI ; run;

proc sort data=mvt_pfra                      ; by OCTROI ; run;

data wk.pf_filtros&corte._cart&A4M2.;

merge           wk.pf_&corte._cart&A4M2.          (in=a)

          wk.seguros_carona&A4M2.           (in=_seguro )

          pfat.refcto3m&A4M2m1.          (in=_rfcto3 )

          pfat.refctovd&A4M2m1.          (in=_refinvd drop=TDPRNFOY TDPRNCONT)

          wk.pf_pfra_&A4M2.                    (in=_pfra )

          mvt_pfra                              (in=_mvt );

by OCTROI ;

if a;

retain publico;

          if not _refinvd          and intck('month',d_cre,&dtbdos) ge 12

          and tvdchaine ne &submarino          then publico=0;

          else if not _refinvd                    then publico=1;

          else if _rfcto3                    then publico=3;

          else                               publico=2;

           if tdprnsoc ne (100)                                                                                then _filtro=1;

          else if (talpcposdosa in ("560" "570" "CG1" "CG2" "RET" "DCD" "CTX" "USU")

           or tdprnagcrcvt in (560 570))                                                                      then _filtro=3;

          else if cposa ne 'ENC'                                                                       then _filtro=4;

          else if verrou in (45,55,52) or verrou ge 60                                                            then _filtro=5;

          else if tdcpcficp in (5)                                                                      then _filtro=6;

          else if 0<=todu <100                                                                                                then _filtro=7;

          else if todu <7500                                                                                then _filtro=8;

          else if tvdchaine not in &submarino and (tdcpbmontssr12m gt 2 or

(tdcpbmontssr12m eq 1 and intck('month',d_dms,&dtbdos) lt 3) or

(tdcpbmontssr12m eq 2 and intck('month',d_dms,&dtbdos) lt 6) )                                         then _filtro=9;

else if tvdchaine eq &submarino and (d_dms ne . and

intck('month',d_dms,&dtbdos) lt 3)                                                                       then _filtro=9.1;

          else if modo_pagamento eq 'P'                                                             then _filtro=10;

          else if DEGRESSIVIDADE ne 997 and (DEGRESSIVIDADE<901 or DEGRESSIVIDADE>950)          then _filtro=11;

          else if tvdchaine eq 9100280                                                                      then _filtro=13;

          else if intck('month',d_pcp,&dtbdos) lt 1 and

          tvdchaine not in &submarino                                                                      then _filtro=14;

          else if _seguro                                                                                then _filtro=15;

          else if atpfra ge 1          or _mvt                                                                      then _filtro=16;

              else if tvdchaine ne 9106782 and GRENT = "Nao Abordados"                                        then _filtro=17;

          else if tvdchaine eq 9106782 and publico in (0,1) and NO_SUBA in (201)                               then _filtro=18;

          else if tvdchaine eq 9106782 and publico in (2) and NO_SUBA in (103)                               then _filtro=18;

          else if tvdchaine eq 9106782 and publico in (3) and NO_SUBA in (103,146,201)                     then _filtro=18;

          else                                                                                           _filtro=99;

run;

Thanks so much.

Trusted Advisor
Posts: 2,116

The best performance for the code below

A couple of thoughts.

1) publico does not need to be RETAINed, as it is assigned for each pass through the DATA step.

2) the second IF-THEN-ELSE  sequence might have it's performance by rearranging it to have the highest frequency values of _filtro come first.

Doc Muhlbaier

Duke

Frequent Contributor
Posts: 80

The best performance for the code below

I've these questions?

Would you recommend to make merge with many data sets for this case?

Could it be better using proc sql, considering that I would not need to sort the data sets?

Thanks for the answers.

Trusted Advisor
Posts: 2,116

The best performance for the code below

You could replace the MERGE with a PROC SQL.  The SORT is done in the backround by SQL, so it still gets done.

You would be hard pressed to do the second IF-THEN-ELSE  sequence in SQL.  You could, but it would take a bit of work.  If you don't, it's another pass through the data.

Coming up with 'best performace' has so many varaibles to consider that  we are unlikely to hit it just as an abstract process.

PROC Star
Posts: 1,167

The best performance for the code below

What you have created is a very reasonable solution to the problem. Merging data can always be troublesome.

I don't see anything in it that is inherently slow.

  How many records are in each of your input files?

  Which steps do you feel are taking too long?

A couple of suggestions, but I doubt they'll help much.

  If there are a lot of variables in your input datasets that you won't need, remove them coming into the sorts, keeping just the variables you need.

  You might see some slight improvements from your data step by reordering the if statements so that the options that occur most frequently are tested for first.

I believe that you'd only see a benefit from using hash tables if your sort is too slow. Usually SAS sorts data very quickly.

Another option is to replace the sort/merge process with a PROC SQL join, but as the underlying process will be very similar, I don't see major improvements. Maybe someone else has experience with comparing sort/merge with join.

Tom

Frequent Contributor
Posts: 80

The best performance for the code below

The data sets wk.pf_&corte._cart&A4M2 and mvt_pfra and wk.pf_pfra_&A4M2. are very large. the other are small

Trusted Advisor
Posts: 2,116

The best performance for the code below

One other performance tool that may be applicable is the use of FORMATs for subsetting.  If the small datasets just have a few variables that are used in the subsetting, then you may be able to convert some of the variables into formats for the subsetting.

Frequent Contributor
Posts: 80

The best performance for the code below

Thanks Doc@Duke and Tom. I'm considering your suggestions. What about if I create one data set with merge/join taking all variable needed and make the conditional later? Maybe using FORMAT.

Could it be a better solution?

Thanks

Augusto Souza

PROC Star
Posts: 1,760

Re: The best performance for the code below

Try to use hash tables and use option fullstimer to watch memory usage.

When you have speed issues, only discard the hash table option when you have tested and are sure you don't have enough memory. In this case, one large table and few small ones might well fit in memory.

More RAM might be a cheap and effective improvement in your case if you are not limited by the OS. Another tiny hardware change that will make a huge difference on things like sorts is to use solid state storage for the work area. Highly recommended.

Super Contributor
Posts: 349

Re: The best performance for the code below

Hi,

Is Proc Sort is taking lot of time ...you can avoid that by creating index on OCTROI variable and you avoid sort steps completely.

Thanks,

Shiva

PROC Star
Posts: 1,760

Re: The best performance for the code below

Shivas, have you ever tried to merge large unsorted tables by index? It is way slower than sorting in my experience.

If memory doesn't allow hash tables, I reckon formats is the next thing to try.

Ask a Question
Discussion stats
  • 10 replies
  • 569 views
  • 6 likes
  • 5 in conversation