About Bernd_S

Bernd_S · ‎03-07-2024

Hi, @yabwon Thank you very much, that is exactly what I need. I also didn't know and didn't expect that the use of an array would be so much faster. I can make ID_2 numeric. Your solution on my machine: real time 1:04.32 user cpu time 55.27 seconds system cpu time 9.06 seconds memory 96561.62k OS Memory 132524.00k @Patricks Hash solution: real time 4:46.23 user cpu time 4:41.24 system cpu time 5.00 seconds memory 132394.82k OS Memory 168216.00k @Patrick Thanks again for your suggestions and analysis.

Bernd_S · ‎03-07-2024

Hi, thanks everybody for the replies. Since questions were asked about the tables and what I try to achieve, I will provide more background information. Table A has the primary key i, id, id_2 and its creation is such that it is sorted by i. I think a can change the higher level program such that table A would be sorted by i, id, id_2 without much overhead if this helps. Table B has the primary key id, id_2, category. It is sorted by the primary key. The aggregation I'm talking about is the sum(x) over each i. The higher level program is basically a monte carlo simulation, which is already run on multiple sessions. I need to simulate a large number of scenarios, i.e. max_i is the number of scenarios of one session or one run of a session and the total number of all scenarios is the sum of max_i over all runs of the sessions. Table A is one outcome of one run of a session, i.e. for nearly each id, id_2 of table B a category is simulated for each scenario i. The simulation of the category is rather complex and I need the whole information in table A also for other purposes. Each run creates a different table A. Furthermore I must calculate the sum of x for each scenario i. Table B is basically a mapping of each CATEGORY of each id, id_2 to a value x. The values of table B are the same for each run. Therefore this rather large join/aggregation must be carried out several times. It is the most time consuming part of the whole program and I would be able to decrease the total time of the higher level program by a significant amount if this step could be run e.g. a minute faster. @Patrick: Thanks for your analysis, this coincides with my findings. Your programmed hash is slightly faster than mine. So right now this is my prefered solution. Due to the nature of the tables the left join always hits, therefore we can skip the if hh_bew.find()=0 part. @Reeza: Since I'm creating the tables in the higher level program I could put indexes on the tables. I'm however not familiar with indexes. Depending on max_i I retrieve a different percent of the rows of table B. For max_i=5600 around 80 % are retrieved. Could this help? How would this step look with indexes? @AhmedAl_Attar: Thanks for your suggestion. I used your code and on my machine it seems like the real time of it is slightly longer than @Patricks suggestion and the cpu time is around the double of the hash method.

Bernd_S · ‎03-06-2024

Hello, I have a rather large join and aggregation, which is part of a higher level program. The join and aggregation is used many times in the higher level program with tables which are structurally the same but contain different data. Therefore it would be nice to speed this step up. /* We have one large table A (~ 550 Mio. lines) and one smaller table B (~ 1.2 Mio. lines). Table A is sorted by i id id_2, table B is sorted by id id_2. B only contains id and id_2 entries of A, due to the creation of the tables. Table A might contain not every entry of id and id_2 of table B. This should probably be irrelevant for the question and is therefore not modeled here */ /***************************************************************************************************/ /* Create sample data: */ %let max_i = 5600; %let max_id = 55000; %let max_category = 22; data _ids; do ID = 1 to &max_id.; ID_2 = "1"; output; /* We have some ids for which id_2 is 1 and 0*/ if rand('uniform',0,1) < 0.01 then do; ID_2="0"; output; end; end; run; proc sort data=_ids; by ID ID_2;run; data _i; do I = 1 to &max_i.; output; /* Not every i is given. */ /* Furthermore it is not modeled here but might be important: we don't know max_i and max_i is different for different table As.*/ if rand('uniform',0,1) < 0.01 then i+1; end; run; /* Cartesian Product for table A*/ proc sql; create table table_A as select A.I ,B.ID ,B.ID_2 ,ceil(rand('uniform',0,&max_category.)) as CATEGORY from _i A left join _ids B on 1 = 1 order by A.I, B.ID, B.ID_2; quit; /* Create table B*/ data table_B; set _ids; do CATEGORY = 1 to &max_category.; x = rand('uniform',0,1); output; end; run; /***************************************************************************************************/ /* Join and Aggregation */ /* Option A: Join and group by */ proc sql; create table WANT_OPTION_A as select A.I ,sum (B.X) as SUM_X from table_A(sortedby=I ID ID_2) A inner join table_B(sortedby=ID ID_2 CATEGORY) B on A.ID = B.ID and A.ID_2 = B.ID_2 and A.CATEGORY = B.CATEGORY group by A.I; quit; /* Option B: Hash*/ data WANT_OPTION_B(keep = I SUM_X); if 0 then set table_B; if _N_ = 1 then do; declare hash HH_BEW (dataset: 'table_B'); HH_BEW.defineKey ('ID', 'ID_2', 'CATEGORY'); HH_BEW.defineData ('X'); HH_BEW.defineDone (); end; set table_A(sortedby=I ID ID_2); by I; call missing (X); retain SUM_X; if first.I then SUM_X = 0; RC = HH_BEW.find (); SUM_X = sum (SUM_X, X); if last.I; run; Unfortunately I cannot change underlying settings like memsize. However I have some control over the size of table A in the higher level program without loosing significant time there, i.e. I can split table A into several tables (e.g. two with 330 Mio. lines) or aggregate the table (e.g. two with 660 Mio. lines into one with 1260 Mio. lines). Testresults: For a table A with 330 Mio. lines: Option A: ~ 4 minutes user time and ~ 7 minutes cpu time Option B: ~ 4 minutes user time and ~ 4 minutes cpu time For a table A with 660 Mio. lines: Option A: ~ 8 minutes user time and ~ 14 minutes cpu time Option B: ~ 8 minutes user time and ~ 8 minutes cpu time There seems to be no need to change the size of table A. Option B is better but only regarding cpu time. I thought about combining ID, ID_2 and CATEGORY into a single key and using a format on table A. This would get rid of the join or hash. However I still need the full table A with the CATEGORY information for another join, so there is no way I can save time with creating only the aggregate and not table A. I have never used a format for this before, hence it would be a small challenge. Would this be still worth a try? Or is there any other way to speed something up? Thank you in advance.

Bernd_S · ‎02-23-2021

Thanks, for the fast reply.

Bernd_S · ‎02-22-2021

Hello, I'm trying to solve the following optimization problem: min f(x) under h(x) > L, where x is n-dimensional and h is a piecewise linear function. h(x) can also be written as h(x) = h_1(x_1)+...+h_n(x_n), where each h_i is a piecewise linear function. E.g.: h_i = 0 if x_i < 5 and 10 if x_i >=5. Note that each h_i is monotonically increasing. For the problem i'm trying to solve, n is > 100 and each piecewise function h_i consists of around 25 "cutpoints". I have a dataset which describes h (columns: i, "start cutpoint", "end cutpoint", value). Is there an easy way to solve this problem? I tried to solve a really simple version of the problem: n = 2, f(x) = x_1+x_2 and h(x) = 0, unless x_1 >=4, then h(x) = 3. Furthermore I set 1<= x_i<=10 and L = 2. Therefore the optimal solution is (4,1). However the following SAS code fails to deliver the optimal solution: proc optmodel; number n = 2; var x{1..n} >=1 <=10; minimize f = x[1]+x[2]; con loss: (x[1] >= 4)*3 >= 2; solve with NLP /multistart=(Maxstarts=20); print x[1] x[2]; run; Without the multistart option, the solution is not feasible (1.15, 1.15). With the multistart option the solution is better, but not good (4.8,1.6). It seems to me, that the solver is not able to work with the constraint. I tried to write the constraint in other ways, but so far I was not able to achieve a satisfying solution. Obviously the feasible region of the problem I'm trying to solve is complicated, however since each h_i is monotonically increasing the region is connected and therefore it should be possible to solve this problem, or am I missing something? So my question is: How to do an optimization with a piecewise linear constraint? Greetings Bernd

Online Status	Offline
Date Last Visited	‎03-08-2024 03:05 AM

Re: Join and aggregation of large tables - is there a faster way?

Re: Join and aggregation of large tables - is there a faster way?

Join and aggregation of large tables - is there a faster way?

Re: How to do an optimization with a piecewise linear constraint?

How to do an optimization with a piecewise linear constraint?

Re: Join and aggregation of large tables - is there a faster way?

Re: Join and aggregation of large tables - is there a faster way?

Re: Join and aggregation of large tables - is there a faster way?

Re: Join and aggregation of large tables - is there a faster way?

Re: Join and aggregation of large tables - is there a faster way?

Join and aggregation of large tables - is there a faster way?

Re: Join and aggregation of large tables - is there a faster way?

Re: Join and aggregation of large tables - is there a faster way?

Join and aggregation of large tables - is there a faster way?

Re: How to do an optimization with a piecewise linear constraint?

How to do an optimization with a piecewise linear constraint?