I am having a big performance issue with the macro I 've written, Extremely slow (no actual output) when runs with large input data sets., Appreciate any help in re writing the code.
My code is as below.
%macro getHistAllPlanteam(PPID,s_start_date);
data pt_all_planteam_temp (keep=ParentPartyID All_Planners g_snapshot_start_date g_snapshot_end_date );
set &_INPUT1 (where=(Contact_planner ne "n/a" and ParentPartyID="&PPID" and g_snapshot_start_date="&s_start_date"dt));
by ParentPartyID Plan_mgmt_team_member;
length All_Planners $255;
if first.ParentPartyID then do;
retain All_Planners;
All_Planners=cat(trim(Plan_mgmt_team_member)," (",trim(Participant_Management_Role_Desc),")");
end;
else do All_Planners=catx("; ",trim(All_Planners),cat(trim(Plan_mgmt_team_member)," (",trim(Participant_Management_Role_Desc),")"));
end;
if last.ParentPartyID then output;
run;
proc append base=&_OUTPUT1 data=pt_all_planteam_temp force;
run;
%mend getHistAllPlanteam;
proc sql;
delete from &_OUTPUT1;
run;
Data _Null_;
set &_INPUT2;
st_dtmval=put(g_snapshot_start_date,DATETIME.);
call execute('%getHistAllPlanteam('||ParentPartyID||','||st_dtmval||')');
run;
Below is a sample input;
&_INPUT2
ParentPartyID g_snapshot_start_date g_snapshot_end_date
1-104O5G 26FEB15:00:00:00 01JAN99:00:00:00
&_INPUT1
ParentPartyID g_snapshot_start_date g_snapshot_end_date Plan_mgmt_team_member Participant_Management_Role_Desc
1-104O5G 26FEB15:00:00:00 1JAN99:00:00:00 David Addison Primary Planner
1-104O5G 6FEB15:00:00:00 01JAN99:00:00:00 Rachael Mayne Plan Support Coordinator
Expected output:
ParentPartyID All_plan_mgmt_team g_snapshot_start_date g_snapshot_end_date
1-104O5G David Addison (Primary Planner); Rachael Mayne (Plan Support Coordinator) 26FEB15:00:00:00 01JAN99:00:00:00
There is no slow performance of a macro. Your design for the data-processing is one only acceptable for very small datasets (Mb's).
Problems:
1/ you are generating separate datasteps/appends using the call execute
2/ There is a delete from * with proc sql, This could be the only one working with an external DBMS (no DDL allowed only DML).
3/ not seeing any thoughts for big data tuning.
Going back to the basics.
- what is your environment (OS SAS )
- limitations (sizing dasd - memory)
- Is the data in an external DBMS or only local
- What is your logical problem/issue.
Bulk/batch processing van be tuned for that purpose as there isn online/transactional processing. The latter one always is having some performance penalties.
HI Jaap,
I'm using the code in SAS DI custom made transformation.
I am using Windows 7
I am having about 4GB of memory but my dataset set is having only
&_INPUT1=16,000 rows and &_INPUT2=13196 rows.
Data is only as SAS Datasets, no RDBS involved.
Logically,what I need to achieve is for each parentpartyID concatenate the values of Participant_Management_Role_Desc across Plan_mgmt_team_member and put into another distinct variable in a SAS DI Transformation.
I have initially discuss first issue of this problem in: https://communities.sas.com/thread/75832
Thank You!
Small..... You decided to write own code (macro) not DI transforms. Wating for the hash approach of Xia ( he is experienced-fast with that) and executing will be fasssst
Thanks your compliment .Jaap Karman
Since OP didn't give enough data , It is hard to give some advice.
Hi Xia,
Thanks for your concern. I have attached 2 sample input data file and sas code file which will create output1 as I required.
This is not working when input files are large. Quite new to hash tables appreciate your help.
Cheers
It seems that you are doing some QUERY thing . Hash Table would be good for that.
Could be Xia could be... Let us wait the op's coming with all information.
- May be a hash is good when it is not Iimited by memory constraints,
- May be some sql or indatabase processing when data is in an external dbms.
- May be some advanced buffer tuning or some other smart analytics solution
- May be...
OK. Hope you have enough memory to handle this.
proc import datafile='C:\temp\Input1.csv' out=input1 dbms=csv replace; delimiter='|'; run; proc import datafile='C:\temp\Input2.csv' out=input2 dbms=csv replace; delimiter='|'; run; data temp; if _n_ eq 1 then do; if 0 then set input2; declare hash h(dataset:'input2',hashexp:20); h.definekey('ParentPartyID','g_snapshot_start_date'); h.definedone(); end; set input1; if h.check()=0 and Contact_planner ne "n/a" ; run; proc sort data=temp;by ParentPartyID g_snapshot_start_date;run; data want; set temp; by ParentPartyID g_snapshot_start_date; length All_Planners $ 255; retain All_Planners; All_Planners=catx(" ",All_Planners,Plan_mgmt_team_member,cats("(",Participant_Management_Role_Desc,")")); if last.g_snapshot_start_date then do;output; call missing(All_Planners);end; run;
Xia Keshan
Thanks for that, yet in my code my By variables are ParentPartyID and Plan_mgmt_team_member , I will try to add it and check it out;
So it does not look like you need a macro at all. You just want to merge the two dataset and collapse the rows.
data INPUT1 ;
infile cards dsd dlm='|' ;
length ParentPartyId $10
g_snapshot_start_date g_snapshot_end_date 8
Plan_mgmt_team_member
Participant_Management_Role_Desc
Contact_planner
Contact_RSO_or_RAO
Contact_LAC $50
;
informat g_snapshot_start_date g_snapshot_end_date datetime.;
format g_snapshot_start_date g_snapshot_end_date datetime.;
input ParentPartyId -- Contact_LAC;
cards;
1-100O7B|16MAR15:00:00:00|01JAN99:00:00:00|Christine Godden|Primary Planner|Christine Godden|n/a|n/a
1-100O7B|13MAR15:00:00:00|15MAR15:23:59:59|David Addison|Primary Planner|David Addison|n/a|n/a
1-100O7B|13MAR15:00:00:00|15MAR15:23:59:59|Rachael Mayne|Plan Support Coordinator|Rachael Mayne|n/a|n/a
1-101HPT|16DEC14:00:00:00|01JAN99:00:00:00|Annalise Korsch|Primary Planner|Annalise Korsch|n/a|n/a
run;
data INPUT2 ;
infile cards dsd dlm='|' ;
length ParentPartyId $10
g_snapshot_start_date g_snapshot_end_date 8
;
informat g_snapshot_start_date g_snapshot_end_date datetime.;
format g_snapshot_start_date g_snapshot_end_date datetime.;
input ParentPartyId -- g_snapshot_end_date;
cards;
1-100O7B|13MAR15:00:00:00|15MAR15:23:59:59
1-100O7B|16MAR15:00:00:00|01JAN99:00:00:00
1-101HPT|16DEC14:00:00:00|01JAN99:00:00:00
run;
proc sort data=input1;
by ParentPartyID g_snapshot_start_date g_snapshot_end_date;
run;
proc sort data=input2;
by ParentPartyID g_snapshot_start_date g_snapshot_end_date;
run;
data pt_all_planteam_temp ;
merge INPUT1 (in=in1) INPUT2 (in=in2);
by ParentPartyID g_snapshot_start_date g_snapshot_end_date;
if in1 and in2;
length All_Planners $255;
retain All_Planners;
if first.g_snapshot_end_date then
All_Planners=cat(trim(Plan_mgmt_team_member)," (",trim(Participant_Management_Role_Desc),")")
;
else
All_Planners=catx("; ",trim(All_Planners),cat(trim(Plan_mgmt_team_member)," (",trim(Participant_Management_Role_Desc),")"))
;
if last.g_snapshot_end_date;
keep ParentPartyID g_snapshot_start_date g_snapshot_end_date All_Planners ;
run;
Thanks for the reply, Yet by variable Plan_mgmt_team_member excluded and it has to be in the code. Will check will this approach by including the missing by variable.
Why do you think that variable was excluded?
Did you mean that you want to create more output records based on changes to that variable also? if so then you cannot do the merge and the roll up in one step (without more complicated code) since it is not on the second table you posted.
Hi Tom,
Firstly with Set By A, B variables last.A not necessarily means last.B as well. I have added input.txt file (BY ParentPartyID g_snapshot_start_date) with those variable values. Actually this was my initial problem as well and prevent me from merging in the first instance.
I have added the expected result in the second file called Expected_output.txt where my expected output is the dataset OUTPUT1 in TestCode.sas.
Hope this explains the issue more clearly.
Thanks!
Did you even run the code I posted? It produces the results you requested.
Here is a link that explains how FIRST. and LAST. variables work.
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000761931.htm
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.