About crawfe

crawfe · ‎06-05-2020

I was frustrated with the transform so I was trying the code. I did get the Transpose transform to work eventually. It is very demanding. For future reference: Each output column needs to match and have the exact spelling from the input row names (not surprising). My problem was that the transform was automatically creating an output table name with an underscore; if you didn't select the option to allow for special characters, in the output table properties, it wouldn't work (should be built-in IMO).

crawfe · ‎03-09-2020

I picked this code to try and it worked. Thanks!

crawfe · ‎01-31-2020

Yes, I want a column of var by GroupA only, Group AB only, Group ABC only. Teradata seems a little harder to do that (in DI Studio). As you say, the answer is probably in the Group By Rollup variations. Thanks.

crawfe · ‎01-30-2020

Thanks! I will try to get on that TD forum for my specific questions. I have always gotten helpful information here! That worked, giving me 1Jan2020, which with the monyy7. format, displays "Jan2020". Perfect. Thanks.

Patrick · ‎12-27-2019

No, it doesn't. DI Studio and its use is fundamentally different from EG so don't expect it to look and behave the same. DI Studio is a development tool for metadata driven development of batch processes. The final process runs in batch with no user interaction.

crawfe · ‎11-19-2019

Thanks! I kept the format mmyys7. but used the anydtdte7. informat on both the Source and Target tables and it worked. (and I improved my Hash code).

crawfe · ‎10-23-2019

That is pretty cool. For three columns, it gives you all possible two-way summaries also. I am going to have to see how I can use it. Thanks!

crawfe · ‎10-15-2019

Good reference. Thanks!

Ksharp · ‎09-14-2019

The first SQL gave you all the combination of name ,label and date from the table . The second SQL gave you all the combination of new variable (=cats(name,label) and date from the table . You can make some dummy data to test the difference between these two codes by yourself .

crawfe · ‎08-30-2019

Several good solution variations here. Thanks!!

Ksharp · ‎08-29-2019

data have; input name $ _date $ value; date=input(cats('01/',_date),ddmmyy10.); drop _:; format date mmyys7.; datalines; NAME1 12/2017 2 NAME1 07/2018 1 NAME1 02/2019 3 NAME2 01/2018 3 NAME2 11/2018 4 NAME2 07/2019 1 ; data date; do date='01nov2017'd to '01jul2019'd; if month(date) ne month then output; month=month(date); end; drop month; format date mmyys7.; run; proc sql; create table want as select a.*,coalesce(b.value,0) as value from ( select * from (select distinct name from have),(select date from date) ) as a left join have as b on a.name=b.name and a.date=b.date; quit;

hashman · ‎07-31-2019

@Tom: Agreed in more than one sense (see below). I answered the OP question about clearing the lags since this is what the OP had asked for; and I don't think that for problems of this kind using the LAGn function, let alone using multiple queues, is called for. As to your code, note that though it doesn't use the LAGn function explicitly, it essentially emulates the working of LAG6 by moving all the elements of an artificial queue LAG1-LAG6 upward for every record - which is what LAG6 does behind the scenes (but about an order of magnitude more efficiently - I've tested). The advantage of using your artificial queue is that all of its items are accessible to the PDV, while with LAG6, the only item available is at the head of the queue. (That must be why the OP deemed the use of all the functions LAG1 through LAG6 necessary.) Having said that, your code involves a lot of hard coding - the more, the wider the rolling sum window is. Therefore, the same concept can be expressed terser using an array, e.g. (I've set W=3 to allow at least one BY group be larger than W - otherwise we don't have a good test case): data have ; input SN $ Name $ Date Count ; informat date anydtdte. ; format date yymm7. ; cards ; 11075652 NameA 03/2019 12 11075652 NameA 04/2019 4 11075652 NameA 05/2019 3 11075652 NameA 06/2019 1 11075652 NameB 05/2019 1 11075682 NameA 07/2019 1 11075682 NameC 05/2018 2 11075682 NameC 06/2018 2 11075682 NameC 07/2018 2 11075682 NameC 08/2018 0 11075682 NameC 09/2018 2 ; run ; %let w = 3 ; data want ; do until (last.name) ; set have ; by sn name ; array cl count lag1-lag&w ; cum_sum = sum (of cl[*]) ; output ; do _i_ = dim (cl) to 2 by -1 ; cl [_i_] = cl [_i_-1] ; end ; end ; run ; Now, as I've repeatedly stated in the past (including in a similar thread, methinks originated by the same OP in the same vein), the whole idea of computing a rolling sum in a window of size W by calculating all the lags from 1 to W and summing them up is totally misguided. The reason is that to compute a rolling sum, we only need to add the leading item and subtract the item W+1 records back in the file. This way, computing a rolling sum for W=3 and W=100000 is equally efficient, while summing up 3 lag items vs 100000 lag items - not to mention getting them all - is all but. In this case, the leading item is current COUNT, so all we need is the value of COUNT W+1 records back in the file - and that only in the case when the size of the BY group exceeds W+1. That item can be fetched using the LAGw function, e.g.: data want ; do until (last.name) ; set have ; by sn name ; link lag ; cum_sum = sum (cum_sum, count, - _sub) ; output ; end ; count = 0 ; do _n_ = 1 to &w + 1 ; link lag ; end ; return ; lag: _sub = sum (lag%eval(&w+1)(count), 0) ; return ; run ; However, again, using the LAGw function here, even just one, is far from optimal since its internal queue needs to be cleared before/after each BY group, which for large W means a lot of overhead. It's much simpler and less onerous to fetch the item W+1 back by using a direct access SET, e.g.: data want (drop = _:) ; do _n_ = 1 by 1 until (last.name) ; set have curobs = q ; by sn name ; if _n_ > &w + 1 then do ; p = q - (&w + 1) ; set have (keep=count rename=count=_c) point = p ; end ; cum_sum = sum (cum_sum, count, - sum (_c, 0)) ; output ; end ; run ; In principle, there can be situations where we need all the lag values (though I can hardly fancy what for - surely not for computing a rolling sum, as shown above). In this case, using multiple LAG function queues is utterly misguided, too, because a simply array can do much more efficiently: %let w = 3 ; proc sql noprint ; select max (q) into :q from (select count (*) as q from have group sn, name) ; quit ; data want ; array q [-&w:&q] _temporary_ ; do _n_ = 1 by 1 until (last.name) ; set have ; by sn name ; q[_n_] = count ; array ll lag1-lag&w ; do over ll ; ll = q[_n_-_i_] ; end ; output ; end ; run ; Note that by the nature of the algorithm, the array doesn't need to be reinitialized before each BY group because every new BY group just overwrites the requisite number of array items. However, the need to pre-process the input file to size up the array is a bit off-putting. That can be fixed by merely setting the upper bound to something like 1000000 - but that means making the assumption that no by-group is larger. A better solution is using the hash object, as it doesn't require making any assumptions about data as long as memory is plentiful enough for the largest BY-group (though the code gets a little bit more verbose than with the array): %let w = 3 ; data want (drop = _:) ; if _n_ = 1 then do ; dcl hash h () ; h.definekey ("_n_") ; h.definedata ("_c") ; h.definedone () ; end ; do _n_ = 1 by 1 until (last.name) ; set have (rename=count=_c) ; by sn name ; count = _c ; h.replace() ; array v lag1-lag&w ; do over v ; _c = . ; _iorc_ = h.find (key:_n_-_i_) ; v = _c ; end ; output ; end ; run ; Here again, by the nature of the algorithm, essentially emulating the array approach by calling the REPLACE method rather than ADD, there's no need to clear the hash for each BY group. If ADD were used, the CLEAR method would have to be called before (or after) the DOW loop. One technical subtlety is that if the FIND method fails when (_n_ - _i_) < 1, _C remains missing, which is what we want. A little bit of efficiency can be added by calling FIND only when (_n_ - _i_) > 0. Kind regards Paul D.

novinosrin · ‎07-27-2019

Guru, Classic and diligence at best. Thank you * 1e6 or never can thank enough for your time. While I have thoroughly comprehended the explanation, all I can possibly say is your detailed explanation is "sublime and elegant". I mean it. Simply Priceless! Have a good night and take care!

crawfe · ‎07-25-2019

Thanks everyone! I can work with this. I appreciate you sharing your knowledge, through both your code and your "code strategy".

crawfe · ‎07-24-2019

Three successful solutions. Thanks! You folks save me so much time. I added a tweak to Novinosrin's to get the three previous months sum (to ignore an incomplete current month). (I put it here for the next person who needs it 🙂 ). ....lag3=lag3(value) cum_sum=sum(value,of lag:) - value; cum_sum 06/2018 3 3 0 07/2018 5 8 3 08/2018 7 15 8 09/2018 4 19 15 10/2018 5 24 16 11/2018 2 26 16 12/2018 5 31 11

Online Status	Offline
Date Last Visited	‎06-09-2020 11:56 PM

Re: Transpose in DI Studio

Re: Transpost in DI Studio

Transpose in DI Studio

Re: Creating a table of highest n values

Re: Creating a table of highest n values

Re: Creating a table of highest n values

Re: Creating a table of highest n values

Creating a table of highest n values

Re: Group By one column in Teradata

Re: Teradata date conversion

How to set all missing values to zero for all variables?

Re: PROC SQL - Left Join Not Working

Re: Unusual Grouping in SAS EG

Re: Transpose in DI Studio

Re: Creating a table of highest n values

Re: Group By one column in Teradata

Re: Teradata date conversion

Re: EG FIlter Add Values feature available in DI?

Re: Date Informat not found

Re: EG-Multiple Summary Groupings possible in one step?

Re: Calculating CUM SUM for multiple groups

Re: Fill in rows for all dates

Re: Difference between row values within multiple groups

Re: Fill in a date table from a specific date

Re: Difficulty resetting Lag to zero (permanently)

Re: How to reset the cum_sum counter in a Lag procedure

Re: How to fill in missing Date rows

Re: Calculating a 6 month running sum