Hi all
I am trying to incorporate PROC HPDS2 in one of our modelling codes to improve the performance. We are using GREENPLUM database. However before implementing the high performance procedure in the actual modelling program, I tried running a sample code with simple arithmetics using PROC HPDS2.
Strangely, the performance using PROC HPDS2 seemed to be poorer when compared to a normal datastep.
I used the following code:
libname gpdb greenplm server=xxxx db="xxxxx" user="xxxxx" password="xxxxx" schema="xxxxxx";
proc sql noprint;
drop table gpdb.avg_temps;
drop table gpdb.daily_temps;
quit;
options dbidirectexec;
options set=GRIDHOST="xxxxxx ";
options set=GRIDINSTALLLOC="xxxxxxxx";
options set=TKSSH_USER="xxxx";
options set=TKSSH_IDENTITY="xxxxxxx";
data gpdb.daily_temps;
input city $ mon tue wed thu fri;
datalines;
lax 88 92 94 97 86
sfo 65 60 75 72 74
nyc 99 95 94 95 90
phl 92 89 91 93 94
atl 95 99 92 98 94
den 85 87 89 72 73
pit 68 70 72 73 77
rdu 98 95 99 95 96
dtt 88 90 90 87 92
anc 51 56 60 64 62
sea 72 78 77 80 79
msy 98 97 99 98 99
mia 95 92 98 94 96
ord 87 85 84 80 79
dfw 95 96 97 95 97
hou 98 99 98 97 92
las 104 105 102 99 101
pdx 78 82 76 74 80
san 80 81 78 82 80
phx 95 98 95 97 100
cle 75 72 76 80 78
ont 78 80 78 81 72
tpa 94 94 92 90 92
bos 80 78 77 75 79
clt 83 80 79 80 81
;
run;
proc hpds2 data=gpdb.daily_temps
out=gpdb.avg_temps;
performance host="&GRIDHOST" install="&GRIDINSTALLLOC" nodes=all details;
data DS2GTF.out;
dcl double avgf avgc;
dcl char(5) synopsis;
method run();
set DS2GTF.in;
avgf = mean(mon, tue, wed, thu, fri);
avgc = round((avgf - 32.0) * 5.0/9.0, .1);
if avgf >= 95.0 then synopsis = 'Hot';
else if avgf > 80.0 then synopsis = 'Warm';
else if avgf > 60.0 then synopsis = 'Mild';
else synopsis = 'Cold';
end;
enddata;
run;
libname test "/xxxxx/xxxxxx/xxxxx";
data daily_temps;
set gpdb.daily_temps;
run;
data test.test_time;
length synopsis $5;
set daily_temps;
avgf = mean(mon, tue, wed, thu, fri);
avgc = round((avgf - 32.0) * 5.0/9.0, .1);
if avgf >= 95.0 then synopsis = 'Hot';
else if avgf > 80.0 then synopsis = 'Warm';
else if avgf > 60.0 then synopsis = 'Mild';
else synopsis = 'Cold';
run;
While using PROC HPDS2, I have kept both source and destination tables in Greenplum to faciliate in database processing. Please refer the performance of HPDS2 below:
NOTE: The HPDS2 procedure is executing in the distributed computing environment with 4 worker nodes.
NOTE: The data set GPDB.AVG_TEMPS has 25 observations and 9 variables.
NOTE: PROCEDURE HPDS2 used (Total process time):
real time 10.75 seconds
cpu time 3.08 seconds
For the same arithmetics, the time taken by normal data step happens to be:
NOTE: There were 25 observations read from the data set WORK.DAILY_TEMPS.
NOTE: The data set TEST.TEST_TIME has 25 observations and 9 variables.
NOTE: Compressing data set TEST.TEST_TIME increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
I am not sure if I am missing something while doing in database processing. Can someone help me understand what is going wrong and why a high performance procedure that is supposed to be fast is taking much longer times to complete.
Thanks in advance for your valuable suggestions.
Everything with that small of amount of data (25 records) en that response time (10 seconds) is more busy setting up the environment for mass processing then doing some processing.
There is a high initial overhead that only will be profitable with far more data, big data, being processed.
Is it mandatory to have both source and destination datasets in Greenplum database to get the best out of this procedure? Or keeping destination in the database would be sufficient? I am asking this question because I may need to copy a big dataset (~7 million records) from work library to Greenplum before I start using this procedure, if at all keeping the source in the database is mandatory. I expect this step to take a significant amount of time.
If you are trying to do in-database processing then I'd expect that all of the data used by that processing would have to be in the database for best performance.
In fact how would Greenplum even know about your SAS source data if it wasn't inside the database - via an ODBC source perhaps? ODBC is fine for reading and loading database data, but I would not expect it would perform well for high performance modelling.
Thank you!
I'd also suggest running your HPDS2 outside of Greenplum, just in SAS to see what the performance is like.
7M records could be big in your current system but still small for greenplum environment.
The design problem for analytics is avoiding unnecessary datamovement as that is the real cause for a headache.
If really do in-database processing the real big-data should be kept in-dabases, results or intermediate results being allowed to be transferred.
Thanks Jaap. That was helpful. I managed to put everything in to database and the performance has improved dramatically now for the modelling code. However down the stream, a SAS merge seems to be happening with the data. So the data needs to brought back from Greenplum to work. Is there any way to make merge (and possibly any other SAS syntax) work in database?
Or any workaround to handle this data movement?
Thanks,
That behavior of getting data back to SAS is common with a lot of SQL approaches.
With SAS processing you are seeing the SAS libnames unless you are coding explicit SQL, but coding explicit SQL will lose the power of SAS.
You could do an upload to Greenplum of the SAS data. May be greenplum tables of volatile type, that is what is see as option with Teradata.
Joining tables with different libnames will succeed when all definitions are equal SAS/ACCESS(R) 9.2 for Relational Databases: Reference, Fourth Edition (Passing Joins to Greenplum)
When this joining is done in-database the libname options sas-options are getting lost.
You can try to prevent data-transfer by setting an option like: SAS(R) 9.4 SQL Procedure User's Guide (sqlgeneration=dbms) Note the default of none
Most people are thinking in copying data so they are creating new tables. Switch your mind as you can also create views.
Views can be created in:
- the database (greenplum) . Often this is done by a DBA but is see no reason an analyst could do that.
- as proc sql view (SAS server) . It will pass settings to the DBMS well I noticed the obs= to sample= as Teradata translation. I was thinking on it for returning ordered data. That works also. (no point= still not possible)
- as data-step view (SAS server). It can do all kind of things known of the datastep but is lacking some of the interface to the DBMS
A long multistep approach can be build this way in one having smaller number of intermediate materialized data .
Very good reads and tips. Thank you so much.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.