About DonH

DonH · ‎02-23-2024

Have you enabled the Macro In Operator (MINOPERATOR) option? When SAS initially enabled the IN operator for the Macro language they got feedback from lots of customers that it broke their code (I can't recall the details, but I know that was the reason). So SAS added an option to enable that. Apologies if that was mentioned earlier - based on a quick scan I did not see that mentioned.

DonH · ‎04-30-2023

I have to respectfully disagree with Since all grouping implies some kind of sorting, you either need a preceding sorting step, or your dataset has to fit into available memory The hash object does not require the source data to fit into memory. It can (but need not) require the summary table to fit into memory. The book and articles that were noted by @yabwon earlier provide details on memory management issues. Sorting is not required for the hash object solutions described in @yabwon's comments. I assume on your point about grouping by one variable but ordering by another, you are talking about the case where the order variable is not a grouping variable. If you can clarify I can perhaps provide an example. Regardless, unless the summary table is large, I don't see why a sort after the aggregation is a big deal

DonH · ‎04-28-2023

Thanks for the kudos to me and Paul (aka @hashman). I think the choice should depend on the nature of the lookup. Some rough guidelines I have used: Formats are incredibly simple to use if you are only looking a single value. Especially of the relationship between the lookup key and the value to be looked up is relatively static. I've not used ARRAYs much. They have some serious restrictions. The only time I consider them is when the key to the lookup is an integer than can easily be used AND the number of values is small enough that I can conveniently type them in. If the list is long or comes from a data set, loading array is harder IMO that using a CNTLIN data set to create a forrmat, or even to load the data set into a hash table. Hash tables are particularly appropriate IMO when you are looking up multiple values or the lookup is a fuzzy one. And I would like to add that a SQL join or a DATA Step merge can sometimes be a quite effective solution. Bottom line: there is really no generic answer to this question.

DonH · ‎08-29-2022

Thx for posting this. Am just back from a trip so I will check it out later. In the meantime, DOSUBL came into being because Rick Langston thought it would be a good idea to allow the mark up text processed by PROC STREAM to include embedded SAS code. PROC STREAM came into existence when Rick agreed to build a better solution for my DATA Step hack for “SAS Server Pages.” He and I had many long conversations about the complications of how to bring results (e.g., macro variables) back from the secondary/side SAS session. And, my apologies, if any of what I just said is included in Quentin’s presentation.

DonH · ‎11-07-2021

Lots of great techniques have been discussed here. When Paul and I worked together we would regularly have what can best be described as yeah, but conversations. So here is my yeah, but point. The issue from my perspective is pretty straightforward, performance issues depend on a lot of factors and assuming that the observed results for a given set of data (or combinations of data tables) applies across the board is questionable at best. Lots of factors play into this. When creating a cartesian product, the size of the data sets probably matters - both in terms of the number of rows as well as the number of columns (as well as the total length of the columns). When I did performance evaluations, I tried my best to use data sets that looked like the data for the application at hand. All of the approaches presented here are worthy of evaluation. But they need to be evaluated in the context of the particular sets of data at issue.

DonH · ‎11-06-2021

Correct. And there is also a minimum size for the combination of the key and data portions of the table. IIRC that size also depends on the OS. I don’t remember the details, but I’m sure @hashman does 😀.

DonH · ‎08-17-2021

It is no longer available/supported. See https://www.sas.com/en_us/software/on-demand-for-academics.html.

DonH · ‎04-19-2021

Paper 1032-2021 Authors Paul Dorfman, Don Henderson Abstract Aggregating or combining large data volumes can challenge computing resources. For example, the process may be hindered by the system limits on utility space or memory and, as a result, either fail or run too long to be useful. It is a natural inclination to try solving the problem by segregating the input records into a number of smaller segments, processing them independently and combining the results. However, in order for such a divide-and-conquer tactic to work, two seemingly contradictory criteria must be met: First, to aggregate or combine the data correctly, no segment can share its key values with the rest; and second, the segments must be more or less equal in size. In this presentation, we show how a hash function can be used to achieve it for arbitrary input with no prior knowledge of the distribution of the key values among its records. Effectively, the method renders any task of aggregating or combining data of any size doable by splitting its input into a large enough number of segments. Such an approach can be used to process the segments sequentially or in-parallel. The trade-off is the need to partially re-read the data. However, it is a rather small price to pay for making a failing or endlessly running task finish on time. Watch the presentation Watch the authors present this topic -- Uniform Hashing of Arbitrary Input Into Key-Exclusive Segments -- on the SAS Users YouTube channel.

DonH · ‎03-18-2021

First, thanks @PeterClemmensen for the reference to the book. It it not clear what you are asking @Emjay. since you reference two code snippets without a specific question. I suspect your issue is that you don't recognize that the hash object is actually a data table that you can create and access inside a data step. So what we have here is three distinct tables of data being accessed inside the data step. The hash object h is created from your data set and is all the distinct values of account_old. The hash object hh is the list of all account by account_old combinations - with account_old renamed to account_oldest. Your input data have that is read in the SET statement. The set statement is looping thru each observation in the have data set - on observation per execution of the DATA step. It then subsets the data (step 4 in the code) to only continue processing if the account number is not found in the hash table h. The check method returns 0 if it is found and returns a non-zero value if not. So this is subsetting the data to just those that were never a pre-account. Next the do until construct is simply used to find the first row in the hash table hh that has the the value of account on the current row equal to the account_old value in HH. Since your data are sorted so the oldest pre-account is listed first, as soon as the find method finds it, the value is copied and the loop ends.

DonH · ‎09-18-2020

Generating data has a number of use cases, for example: generating test cases generating volume data for performance testing and so on. For our book Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study @hashman and @DonH we needed to generate the sample data for the book. Choosing sample data can be challenging. If you use data that is industry or subject matter dependent, users in other industries have trouble relating (or occasionally dismiss it out of hand). For that reason we decided to use sports related data and choose baseball, in part because @DonH is a baseball geek. There is lots of data collected about baseball games and baseball fans are very focused on the analytics of baseball (referred to as sabermetrics). We were unable to use the XML data for Major League Baseball so we decided to generate data for a complete season of a game we came to call Bizarro Ball. Bizarro Ball is similar to baseball, but it has some bizarre rules that are different, thus the name. We used the hash object in many of the programs to generate the data. During the technical review of the book, we got feedback that describing how we generated the data was interesting, but did not seem to fit the Data Management and Business Intelligence theme of the book. So we decided to not include those details in the book; and instead document them externally. Given that generating data is of broader interest that just what we needed to do for our sample data, we decided that the series of articles we had planned to write might be of interest to SAS users other than those folks who are interested in the book and want to generate different sample data. This article will be updated as we write the additional articles that talk about our general approaches (use of random numbers, random selection, parameter files, what to parameterize vs. what to hard-code, and so on). So please follow this article if you are interested in being notified about the followon articles that address these topics in more detail.

DonH · ‎09-18-2020

Debugging logic errors in SAS data step programs can be challenging. However, there are lots of well-known and established techniques, including, but not limited to: using the PUT statement to display the values of key variables in the SAS log using the SAS debugger Unfortunately the contents of a SAS hash object (aka a hash table) can not be displayed to the log using a PUT statement; likewise the debugger does not recognize the SAS hash object - the ex(amine) debugger statement displays this message when used with a SAS hash object pointer: Cannot print object type As Paul (aka @hashman) and I (aka @DonH) worked on our SAS Press Book Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study, we used some tips and tricks to facilitate the debugging of the examples in the book. We would like to share a couple of those in this article. So let us first create an example program that is not producing the expected results. As an aside, it is not an easy task to create a program that is wrong on purpose in a subtle way. So you will have to bear with our example program whose error may be obvious to you. We want to use the hash object to create aggregates using a class variable(s) and several analysis variables. The wrinkle is that the data are not sorted by the class variable. And yes, we know that there are lots of SAS procedures that do this out of the box (but see the immediately preceding comment about creating programs that have non-obvious bugs). Specifically we want to summarize the SASHELP.CLASS data set to create an output dataset with the sum of the height and weight variables for each value of the variable Sex. The following program is our first attempt to do this: data _null_; if _n_ = 1 then do; dcl hash genderSum(ordered:"A"); genderSum.defineKey("Sex"); genderSum.defineData("Sex","Count","Height_Sum","Weight_Sum"); genderSum.defineDone(); end; do until(last.sex); set sashelp.class end=lr; by sex notsorted; if first.sex then call missing(Count,Height_Sum,Weight_Sum); Count + 1; Height_Sum + Height; Weight_Sum + Weight; end; genderSum.replace(); if lr then genderSum.output(dataset:"Sums"); run; We define the hash object on the first execution of the DATA step and then use a DoW loop to reach each group of data with the same value for the variable Sex. As is typical for first. and last. processing, we make sure to initialize our sum variables to missing when encountering a new value for the variable Sex and use SUM statements to create the needed aggregates. At the end of each BY group we use the REPLACE method to update our hash table. The REPLACE method will add a row if the key (in this case SEX) is not found, otherwise it will update the current row with the values of the PDV host variables. Upon running this program we can see that the results seem to be wrong: What we would like to do in order to debug this is to examine the contents of the hash table pointed to by genderSum after each BY group. The question is how? One approach is to use the OUTPUT method with a unique name for each data set. Since the DoW loop executes once for each BY group, the variable _N_ (the DATA step execution counter) does that. So we add an OUTPUT method call after the call to the REPLACE method. Note that the name of the output data set is defined as a character expression - thus, allowing us to create separate data sets: genderSum.replace(); genderSum.output(dataset:cats("Sums",_n_)); if lr then genderSum.output(dataset:"Sums"); This will cause data sets Sums1, Sums2, . . . . to be created. As a result of reviewing this output, the first sign of trouble is the Sums4 data set. The value of Count should be increasing. The last output data set (obviously) matches our original output. Our next step is to examine the source data and discover that the last F group has 4 rows and the last M group has 5. Thus the problem is an issue with initialization. Upon reviewing our logic we recognize that resetting the totals at FIRST.SEX does not work as expected. Upon encountering a group already in the hash table, we want the find method to load the current cumulative values into the PDV host varialbles. The issue is that since the data are grouped, but not sorted, the FIRST. check is resetting the values to missing with each new group. We only want to initialize the values to missing the first time each group (e.g., value of SEX is encountered). And if we are reading data for a different value of Sex we need to copy the values from the hash table to the PDV host variables Count, Height_Sum and Weight_Sum. So we replace the IF statement that is invoking the MISSING function as follows: if first.sex and genderSum.find() ne 0 then call missing(Count,Height_Sum,Weight_Sum); We immediately discover that the data are still off: So at this point we decide to use the debugger to check the values of the PDV host variables and discover that due to the ordering of the FIND, REPLACE and OUTPUT methods we are still overwriting the values for Count, Height_Sum and Weight_Sum. The issue is that genderSum.find() is executed as each observation is read. So our sum variables do not retain the values from the previous row if it had the same value for the variable Sex. We need to initialize the sum variables to missing on the first execution of the DATA step for each value of Sex, but we don't want the FIND method to overwrite the sum PDV host variables on other rows. That is where the CHECK method comes into play. Use the CHECK method to determine if the values should be set to missing because we are encountering a new value (i.e., not in the hash table) of our class variable Sex; and the FIND method otherwise to load the current cumulative values into the PDV host variables. Our final (and correct) program follows: data _null_; if _n_ = 1 then do; dcl hash genderSum(ordered:"A"); genderSum.defineKey("Sex"); genderSum.defineData("Sex","Count","Height_Sum","Weight_Sum"); genderSum.defineDone(); end; do until(last.sex); set sashelp.class end=lr; by sex notsorted; if first.sex then if genderSum.check() ne 0 then call missing(Count,Height_Sum,Weight_Sum); else genderSum.find(); Count + 1; Height_Sum + Height; Weight_Sum + Weight; end; genderSum.replace(); if lr then genderSum.output(dataset:"Sums"); run; Before closing, one last tip on using the OUTPUT method when debugging. Recall the code where we used a distinct name for the name of the output data set so we could see the contents of the hash object after each group of data was processed: genderSum.replace(); genderSum.output(dataset:cats("Sums",_n_)); if lr then genderSum.output(dataset:"Sums"); Instead of a unique name you can use the OUTPUT method with a fixed name and use the debugger to pause after each group of data are processed. Since there is one execution of the DATA step per BY group, you can WATCH _N_ and when the debugger caused the program to pause, just double click on the data set name in the explorer window to see the contents. genderSum.replace(); genderSum.output(dataset:"Interim") if lr then genderSum.output(dataset:"Sums"); This screenshot of the Log, Debugger and Explorer Window highlights how the Interim data set is output after each BY group.

DonH · ‎09-14-2020

Nope. The code I posted used a WHERE clause and I explicitly said the DATA step was simply a surrogate for the code to process that data - the implication being that the WHERE clause should be used in either the hash or your proposed SQL approach. And that point was reinforced by the inclusion of a sample PROC APPEND step to create the cumulative result set. I was not suggesting that a local copy be made.

DonH · ‎09-14-2020

Regardless of whether you adopt ChrisNZs approach using SQL or hash tables, permit me suggest you use a macro loop to partition the data. A simple technique to partition the data and is easily updated to change the size of the subset is to pick a digit from the account number to filter on. Typically that can result in each subset being 10%. And if that subset is too large, use two digits - giving you approximately a 1% subset; and three digits a .1% subset; and so on. For example: %macro LoopThruSubsets(digits=1,stopAt=); %local i; proc delete data=cumulativeResults; /* clear out before starting */ run; %do i = 0 %to %sysfunc(coalescec(&stopAt,10**&digits-1)); %put loop for last &digits digits = %sysfunc(putn(&i,z&digits.)); data subset; set have; where mod(acct,10**&digits)=1; run; proc append base=cumulativeResults data=subset; run; %end; %mend; Note that I used a simple data set step (edited to further clarify that the intent was not to create a local subset for processing) that does nothing but subset the data to highlight the approach. You can then call the macro to get a 10% sample like this: %LoopThruSubsets(digits=1) And at 1% sample: %LoopThruSubsets(digits=2) And if you want to test using just the first 1% sample, call the macro this way: %LoopThruSubsets(digits=2,stopAt=0)

DonH · ‎09-12-2020

Mark makes a good point about this being a problem that can be done with the Hash of Hash approach. However, that approach requires that you can fit all the transactions for all the customers into memory. If you can't, you will have to do some sort of looping. Given that you have to implement looping thru customers (I'm assuming that each customer is processed independently), the benefit of the Hash of Hash approach (a separate hash object for each customer) may not be worth it (unless you also have lots of customers. There are multiple ways to do this looping. Before I suggest anything can you provide any details about how many customers you could have in your input file and a rough guesstimate as to the maximum number of transactions a customer may ?

DonH · ‎09-11-2020

I believe the has object is a good fit. You might want the check out the section on stacks - a separate stack for each day - output (as appropriate) and clearing as you complete each day. I do have a couple of questions about your requirements: 3)Count those transactions that qualify 4)Sum those transaction amounts that qualify 5)Divide to find average I am not sure I know what you mean by "transactions that qualify" as well as what to do with the results.

Online Status	Offline
Date Last Visited	‎04-05-2024 10:32 PM