Re: Slow performance macro

ven · Posted 05-23-2015 04:02 AM

I am having a big performance issue with the macro I 've written, Extremely slow (no actual output) when runs with large input data sets., Appreciate any help in re writing the code.

My code is as below.

%macro getHistAllPlanteam(PPID,s_start_date);

data pt_all_planteam_temp (keep=ParentPartyID All_Planners g_snapshot_start_date g_snapshot_end_date );
set &_INPUT1 (where=(Contact_planner ne "n/a" and ParentPartyID="&PPID" and g_snapshot_start_date="&s_start_date"dt));
by ParentPartyID Plan_mgmt_team_member;

length All_Planners $255;

if first.ParentPartyID then do;
retain All_Planners;
All_Planners=cat(trim(Plan_mgmt_team_member)," (",trim(Participant_Management_Role_Desc),")");
end;
else do All_Planners=catx("; ",trim(All_Planners),cat(trim(Plan_mgmt_team_member)," (",trim(Participant_Management_Role_Desc),")"));
end;
if last.ParentPartyID then output;

run;

proc append base=&_OUTPUT1 data=pt_all_planteam_temp force;
run;

%mend getHistAllPlanteam;

proc sql;
delete from &_OUTPUT1;
run;

Data _Null_;
set &_INPUT2;
st_dtmval=put(g_snapshot_start_date,DATETIME.);
call execute('%getHistAllPlanteam('||ParentPartyID||','||st_dtmval||')');
run;

Below is a sample input;

&_INPUT2
ParentPartyID g_snapshot_start_date g_snapshot_end_date
1-104O5G 26FEB15:00:00:00 01JAN99:00:00:00

&_INPUT1
ParentPartyID      g_snapshot_start_date      g_snapshot_end_date                          Plan_mgmt_team_member Participant_Management_Role_Desc
1-104O5G            26FEB15:00:00:00            1JAN99:00:00:00                                    David Addison                  Primary Planner
1-104O5G            6FEB15:00:00:00              01JAN99:00:00:00                                  Rachael Mayne                    Plan Support Coordinator

Expected output:

ParentPartyID All_plan_mgmt_team g_snapshot_start_date g_snapshot_end_date

1-104O5G David Addison (Primary Planner); Rachael Mayne (Plan Support Coordinator) 26FEB15:00:00:00 01JAN99:00:00:00

jakarman · Posted 05-23-2015 06:49 AM

There is no slow performance of a macro. Your design for the data-processing is one only acceptable for very small datasets (Mb's).
Problems:

1/ you are generating separate datasteps/appends using the call execute

2/ There is a delete from * with proc sql, This could be the only one working with an external DBMS (no DDL allowed only DML).

3/ not seeing any thoughts for big data tuning.

Going back to the basics.

- what is your environment (OS SAS )

- limitations (sizing dasd - memory)

- Is the data in an external DBMS or only local

- What is your logical problem/issue.

Bulk/batch processing van be tuned for that purpose as there isn online/transactional processing. The latter one always is having some performance penalties.

---->-- ja karman --<-----

ven · Posted 05-23-2015 10:28 AM

HI Jaap,

I'm using the code in SAS DI custom made transformation.

I am using Windows 7

I am having about 4GB of memory but my dataset set is having only

&_INPUT1=16,000 rows and &_INPUT2=13196 rows.

Data is only as SAS Datasets, no RDBS involved.

Logically,what I need to achieve is for each parentpartyID concatenate the values of Participant_Management_Role_Desc across Plan_mgmt_team_member and put into another distinct variable in a SAS DI Transformation.

I have initially discuss first issue of this problem in: https://communities.sas.com/thread/75832

Thank You!

jakarman · Posted 05-23-2015 11:05 AM

Small..... You decided to write own code (macro) not DI transforms. Wating for the hash approach of Xia ( he is experienced-fast with that) and executing will be fasssst

---->-- ja karman --<-----

Ksharp · Posted 05-23-2015 11:06 PM

Thanks your compliment .Jaap Karman

Since OP didn't give enough data , It is hard to give some advice.

ven · Posted 05-24-2015 09:00 AM

Hi Xia,

Thanks for your concern. I have attached 2 sample input data file and sas code file which will create output1 as I required.

This is not working when input files are large. Quite new to hash tables appreciate your help.

Cheers

Ksharp · Posted 05-23-2015 07:00 AM

It seems that you are doing some QUERY thing . Hash Table would be good for that.

jakarman · Posted 05-23-2015 07:46 AM

Could be Xia could be... Let us wait the op's coming with all information.
- May be a hash is good when it is not Iimited by memory constraints,
- May be some sql or indatabase processing when data is in an external dbms.
- May be some advanced buffer tuning or some other smart analytics solution
- May be...

---->-- ja karman --<-----

Ksharp · Posted 05-24-2015 10:14 AM

OK. Hope you have enough memory to handle this.

proc import datafile='C:\temp\Input1.csv' out=input1 dbms=csv replace;
delimiter='|';
run;
proc import datafile='C:\temp\Input2.csv' out=input2 dbms=csv replace;
delimiter='|';
run;

data temp;
 if _n_ eq 1 then do;
  if 0 then set input2;
  declare hash h(dataset:'input2',hashexp:20);
  h.definekey('ParentPartyID','g_snapshot_start_date');
  h.definedone();
 end;
set input1;
if h.check()=0 and Contact_planner ne "n/a" ;
run;
proc sort data=temp;by ParentPartyID g_snapshot_start_date;run;
data want;
 set temp;
 by ParentPartyID g_snapshot_start_date;
 length All_Planners $ 255;
 retain All_Planners;
 All_Planners=catx(" ",All_Planners,Plan_mgmt_team_member,cats("(",Participant_Management_Role_Desc,")"));
 if last.g_snapshot_start_date then do;output; call missing(All_Planners);end;
run;

Xia Keshan

ven · Posted 05-25-2015 07:12 PM

Thanks for that, yet in my code my By variables are ParentPartyID and Plan_mgmt_team_member , I will try to add it and check it out;

Tom · Posted 05-25-2015 12:42 PM

So it does not look like you need a macro at all. You just want to merge the two dataset and collapse the rows.

data INPUT1 ;

infile cards dsd dlm='|' ;

length ParentPartyId $10

g_snapshot_start_date g_snapshot_end_date 8

Plan_mgmt_team_member

Participant_Management_Role_Desc

Contact_planner

Contact_RSO_or_RAO

Contact_LAC $50

;

informat g_snapshot_start_date g_snapshot_end_date datetime.;

format g_snapshot_start_date g_snapshot_end_date datetime.;

input ParentPartyId -- Contact_LAC;

cards;

1-100O7B|16MAR15:00:00:00|01JAN99:00:00:00|Christine Godden|Primary Planner|Christine Godden|n/a|n/a

1-100O7B|13MAR15:00:00:00|15MAR15:23:59:59|David Addison|Primary Planner|David Addison|n/a|n/a

1-100O7B|13MAR15:00:00:00|15MAR15:23:59:59|Rachael Mayne|Plan Support Coordinator|Rachael Mayne|n/a|n/a

1-101HPT|16DEC14:00:00:00|01JAN99:00:00:00|Annalise Korsch|Primary Planner|Annalise Korsch|n/a|n/a

run;

data INPUT2 ;

infile cards dsd dlm='|' ;

length ParentPartyId $10

g_snapshot_start_date g_snapshot_end_date 8

;

informat g_snapshot_start_date g_snapshot_end_date datetime.;

format g_snapshot_start_date g_snapshot_end_date datetime.;

input ParentPartyId -- g_snapshot_end_date;

cards;

1-100O7B|13MAR15:00:00:00|15MAR15:23:59:59

1-100O7B|16MAR15:00:00:00|01JAN99:00:00:00

1-101HPT|16DEC14:00:00:00|01JAN99:00:00:00

run;

proc sort data=input1;

by ParentPartyID g_snapshot_start_date g_snapshot_end_date;

run;

proc sort data=input2;

by ParentPartyID g_snapshot_start_date g_snapshot_end_date;

run;

data pt_all_planteam_temp ;

merge INPUT1 (in=in1) INPUT2 (in=in2);

by ParentPartyID g_snapshot_start_date g_snapshot_end_date;

if in1 and in2;

length All_Planners $255;

retain All_Planners;

if first.g_snapshot_end_date then

All_Planners=cat(trim(Plan_mgmt_team_member)," (",trim(Participant_Management_Role_Desc),")")

;

else

All_Planners=catx("; ",trim(All_Planners),cat(trim(Plan_mgmt_team_member)," (",trim(Participant_Management_Role_Desc),")"))

;

if last.g_snapshot_end_date;

keep ParentPartyID g_snapshot_start_date g_snapshot_end_date All_Planners ;

run;

ven · Posted 05-25-2015 07:15 PM

Thanks for the reply, Yet by variable Plan_mgmt_team_member excluded and it has to be in the code. Will check will this approach by including the missing by variable.

Tom · Posted 05-26-2015 10:12 AM

Why do you think that variable was excluded?

Did you mean that you want to create more output records based on changes to that variable also? if so then you cannot do the merge and the roll up in one step (without more complicated code) since it is not on the second table you posted.

ven · Posted 05-26-2015 09:06 PM

Hi Tom,

Firstly with Set By A, B variables last.A not necessarily means last.B as well. I have added input.txt file (BY ParentPartyID g_snapshot_start_date) with those variable values. Actually this was my initial problem as well and prevent me from merging in the first instance.

I have added the expected result in the second file called Expected_output.txt where my expected output is the dataset OUTPUT1 in TestCode.sas.

Hope this explains the issue more clearly.

Thanks!

Tom · Posted 05-26-2015 10:41 PM

Did you even run the code I posted? It produces the results you requested.

Here is a link that explains how FIRST. and LAST. variables work.

http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000761931.htm

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away