About hashman

hashman · ‎05-12-2020

Hi @novinosrin: do _n_ = h.clear() by 0, eh? I kind of recognize the style - David Cassell calls it a "dorfmanism" - but it's never occurred to Dorfman himself to use it in this manner. Kudos! Kind regards Paul D.

hashman · ‎04-08-2020

@whymath: At the risk of self-aggrandizing, I'd suggest that you read one of my APP papers, for instance (not sure it's the latest): https://support.sas.com/resources/papers/proceedings14/1510-2014.pdf There're many examples in the paper - and, perhaps more importantly, a decent amount of theory. A good theory, as physicists say, is the most practical thing. Kind regards Paul D.

hashman · ‎03-17-2020

@left : Just add _n_alive to the data portion of OUT and and CALL MISSING and also add: _n_alive = sum (_n_alive, Status =: "A") ; in the same vein as _n and _sum. Kind regards Paul D.

hashman · ‎01-30-2020

@mkeintz: Mark, whenever I had to resort to this SLEEP+Windows Task Manager subterfuge, it always made me internally swear at not having a SAS function that would return the amount of memory being currently used by the step or at least the SAS session. Methiinks it wouldn't be too hard to implement in the underlying software. Kind regards Paul D.

hashman · ‎01-30-2020

@yabwon : With your fabulous errata sheet taken in to account! Kind regards Paul D.

hashman · ‎01-30-2020

@PeterClemmensen: Thanks for drawing attention to this angle. @DonH and I had discussed whether to include MEMRC in the book (it was in "my" chapter and I originally did) but decided against it weighing on how to reduce the over-the-limit page count. This argument tag gives the programmer the option to continue processing if a hash table is filled beyond the system memory limit rather than abending the step - and the program. Conceivably, it can be used to decide programmatically to resort to a different method of processing should the hash step overfill the memory. Sort of like "if the hash step should overfill the memory, detect the condition and use a different piece of code - e.g. some divide-and-conquer methodology - to process the data in a different manner to attain the same goal". One problem with this approach is under certain scenarios, the hash step can run for hours before running out of hash memory - as @DonH and I have found the hard way in the real data processing world while (ab)using the hash object for data aggregation. When one aggregate was done with, the table would be voided via CLEAR and proceed to the next aggregate, and it could turn out that the aggregate that would overfill the table beyond the memory limits would be one of the last after hours of the job running. However, methinks it's better to have MEMRC just in case rather than not having it at all, however rare and exotic its usage might be. Kind regards Paul D.

hashman · ‎01-29-2020

@tennis1: With the desired output provided, it's better. But your counts don't add up. The way I count manually - and both programs below count, the output expected from your input should be: ID total_overlap_days ----------------------- 1 9 2 8 3 27 4 28 5 28 Apropos, I find it unnecessary and development-hindering to sample input of the kind you've provided, particularly in terms of the dates presented in the diabolical MM/YY/DD format. Something like this: data have ; input id drug $ start_dt end_dt ; cards ; 1 A 1 10 1 B 7 12 1 A 14 17 2 A 1 20 2 B 10 16 2 B 19 23 ; is way better because with it I can count the overlaps per ID manually in a minute, plus play with the input data and evaluate the expected output much nimbler. At any rate, if the total overlap count is all you want, you can simply do: data have ; input id drug $ (start_dt end_dt) (:mmddyy8.) ; cards ; 1 A 10/14/19 11/14/19 1 B 11/06/19 12/06/19 1 A 12/09/19 01/09/20 2 A 10/01/19 11/01/19 2 B 10/25/19 11/25/19 2 B 12/01/19 12/31/19 3 A 10/06/19 11/06/19 3 B 10/01/19 11/01/19 4 A 11/01/19 11/30/19 4 B 11/03/19 12/03/19 4 B 12/15/19 01/15/20 5 B 10/05/19 11/05/19 5 A 10/01/19 11/01/19 ; data v / view = v ; set have ; do date = start_dt to end_dt ; output ; end ; run ; proc sql ; create table want_sql as select id, count (q) as Total_Overlap_Count from (select id, date, count (distinct drug) as q from v group 1, 2 having q > 1) group id ; quit ; However, it may be more efficient to do it this way because the aggregation is done on the fly: data want_hash (keep = id total_overlap_days) ; if _n_ = 1 then do ; dcl hash h (ordered:"a") ; h.definekey ("date") ; h.definedata ("drug") ; h.definedone () ; end ; do until (last.id) ; set have ; by id ; _drug = drug ; do date = start_dt to end_dt ; if h.find() ne 0 then h.add() ; else if drug ne _drug then Total_Overlap_Days = sum (Total_Overlap_Days, 1) ; end ; end ; h.clear() ; run ; Kind regards Paul D.

hashman · ‎01-29-2020

@MarkusB: Try this: data have ; input M Q ; cards ; 1 10 2 30 ; data want ; set have ; array qm qm_1 - qm_2 ; do over qm ; qm = Q * (_i_ = M) ; end ; run ; Or, if you don't mind having missing values instead of zeroes, it's even simpler: data want ; set have ; array qm_ [2] ; qm_[M] = Q ; run ; Kind regards Paul D.

hashman · ‎01-28-2020

@tennis1: The logically simplest way is to "paintbrush" the dates into an array or a hash table; then you need only simple by-group processing. For example, using a hash table as the medium of choice: data have ; input id drug $ (start_dt end_dt) (:mmddyy8.) ; cards ; 1 A 10/14/19 11/14/19 1 B 11/06/19 12/06/19 1 A 12/09/19 01/09/20 2 A 10/01/19 11/01/19 2 B 10/25/19 11/25/19 2 B 12/01/19 12/31/19 3 A 10/06/19 11/06/19 3 B 10/01/19 11/01/19 4 A 11/01/19 11/30/19 4 B 11/03/19 12/03/19 4 B 12/15/19 01/15/20 5 B 10/05/19 11/05/19 5 A 10/01/19 11/01/19 ; data _null_ ; dcl hash h (ordered:"a") ; h.definekey ("id", "date") ; h.definedata ("id", "date", "olap", "olap_ct") ; h.definedone () ; do until (z) ; set have end = z ; do date = start_dt to end_dt ; if h.find() ne 0 then olap_ct = 1 ; else olap_ct + 1 ; olap = olap_ct > 1 ; h.replace() ; end ; end ; h.output (dataset:"hash") ; run ; data want (keep = id start_dt end_dt olap_ct) ; do until (last.olap) ; set hash ; by id olap notsorted ; if first.olap then start_dt = date ; end ; if olap ; end_dt = date ; format start_dt end_dt yymmdd10. ; run ; One caveat of the above is that it can create quite a sizable hash table, particularly if you process claims, which usually are big files. To avoid overtaxing memory in such a case, we can make use of the existing sorted order by ID, so that the largest chunk of data loaded into the hash is dictated by the largest BY group. At the same time, rather than to concatenate the hashes from all BY groups into file HASH for subsequent BY processing, we can replace it with artificial BY processing on the fly using control-break logic while enumerating the hash table with an iterator. A dummy item can be added to each partial hash to simplify the control-break code (as done below). This way, everything is done in a single step, and memory usage is kept in check. E.g.: data want (keep = id start_dt end_dt olap_ct) ; if _n_ = 1 then do ; dcl hash h (ordered:"a") ; h.definekey ("id", "date") ; h.definedata ("id", "date", "olap", "olap_ct") ; h.definedone () ; dcl hiter hi ("h") ; end ; do until (last.id) ; set have ; by id ; do date = start_dt to end_dt ; if h.find() ne 0 then olap_ct = 1 ; else olap_ct + 1 ; olap = olap_ct > 1 ; h.replace() ; end ; end ; hi.last() ; date + 1 ; olap = 0 ; h.add() ; format start_dt end_dt yymmdd10. ; do _n_ = hi.first() by 0 while (_n_ = 0) ; if olap and ^ _olap then start_dt = date ; if ^ olap and _olap then do ; end_dt = _date ; olap_ct = _olap_ct ; output ; end ; _olap = olap ; _olap_ct = olap_ct ; _date = date ; _n_ = hi.next() ; end ; h.clear() ; run ; Note that in the output file WANT start_dt and end_dt represent the endpoints of the date intervals where the drugs overlap. OLAP_CT gives the number of drugs overlapping within the interval. So, if you had more than 2 drugs per member, and within some interval 3 or more drugs overlapped, you'd see the corresponding number. Of course, the OLAP_CT=1 case (no overlap) is filtered out. Kind regards Paul D.

hashman · ‎01-28-2020

@Angmar: You need to break out of the loop once the diagnosis searched for has been found: data lib.nbhr05 (drop = i) ; set red.red05 (where=(city='3')) ; array diag [25] diag_code_1-diag_code_25 ; do i = 1 to dim (diag) ; if diag[i] in: ("T67", "R55") then do ; output ; leave ; end ; end ; run ; However, the fact is that you don't need any looping at all: data lib.nbhr05 ; set red.red05 (where=(city='3')) ; array diag [25] diag_code_1-diag_code_25 ; if "T67" in diag or "R55" in diag ; run ; Kind regards Paul D.

hashman · ‎01-27-2020

@mkeintz: Mark, a nice complement. Truth be told, though, the crux of this problem isn't which particular tool to use for producing the sums but rather how to knead the input data to enable the summation. I absolutely agree with what you've said of the broader functionality of SUMMARY, but all by itself it's powerless to face this problem head-on. Which, of course, can be said of any other aggregation method used in this thread.A truck may be able to carry a huge tree by its sheer weight; but it cannot do that before the tree is cut into shapes that can fill its bed properly. Kind regards Paul D.

hashman · ‎01-27-2020

@BCNAV: Yet another variation - hash-based: data have ; input depart $ dest $ ocu ; cards ; cyyz egll 500 cyvr cyyz 10 egll cyyz 500 ; data _null_ ; dcl hash h () ; h.definekey ("city_pair") ; h.definedata ("city_pair", "ocu") ; h.definedone () ; do until (z) ; set have (rename=ocu=_ocu) end = z ; city_pair = put (catx ("-", depart, dest), $9.) ; if dest < depart then city_pair = catx ("-", dest, depart) ; if h.find() ne 0 then ocu = _ocu ; else ocu + _ocu ; h.replace() ; end ; h.output (dataset:"want") ; run ; Kind regards Paul D.

hashman · ‎01-27-2020

@michokwu: SQL is perhaps the best (or at least simplest) option: data city ; input @1 City $6. City_Code:$9. Item1-item4 ; cards ; City A 123001001 10 20 30 40 City B 123001002 20 30 40 50 City C 123002001 30 40 50 60 City D 123002002 40 50 60 70 City E 123003001 50 60 70 80 City F 123003002 60 70 80 90 City G 123004001 70 80 90 100 City H 123004002 80 90 100 110 ; data region ; input Region $ Region_Code:$6. ; cards ; North 123001 South 123002 East 123003 West 123004 ; run ; proc sql ; create table want as select region , region_code , sum (item1) as item1 , sum (item2) as item2 , sum (item3) as item3 , sum (item4) as item4 from city, region where region_code = put (city_code, $6.) group 1, 2 order 2 ; quit ; If you loathe to list all the items as above, especially if there are way many than 4, heed what @Reeza has said. It can be done in a single DATA step, too (i.e. without listing all the items), using the hash object; but if you see the requisite code, you'll sure appreciate the simplicity of SQL: data _null_ ; if _n_ = 1 then do ; if 0 then set region city ; array it item: ; dcl hash h (ordered:"a") ; h.definekey ("region_code") ; h.definedata ("region", "region_code") ; do over it ; h.definedata (vname (it)) ; end ; h.definedone() ; do until (lr) ; set region end = lr ; h.add() ; end ; end ; set city (rename=(item1-item4=_it1-_it4)) end = lc ; array _it _it: ; region_code = put (city_code, $6.) ; if h.find() ne 0 then call missing (of item:) ; do over it ; it + _it ; end ; h.replace() ; if lc then h.output (dataset:"want") ; run ; OTOH, instead of listing all the sum(item1) as item1, ... in the SQL query, you can auto-construct a macro variable containing the necessary text: data _null_ ; length sumit $ 32767 ; do i = 1 to 4 ; sumit = catx (",", sumit, cats ("sum(item", i, ") as item", i)) ; end ; call symputx ("sumit", sumit) ; run ; proc sql ; create table want as select region , region_code , &sumit from city, region where region_code = put (city_code, $6.) group 1, 2 order 2 ; quit ; Kind regards Paul D.

hashman · ‎01-27-2020

@yabwon: Barteku, yup, that was my understanding. Fewer passes through the input is a noble goal. And I did realize that I was making an extra pass, just my angle was a bit different here, so I decided it was worth the sacrifice ;). Kind regards Paul D.

hashman · ‎01-27-2020

@yabwon: Bart, methinks it's more economical output-wise to base the splitting on the max LENGTH of the input VAR than on VLENGTH to avoid a bunch of empty trailing output VV's in case the system length >> actual max length. I.e., something like: data have ; length var $ 100 ; do var = "abcdefghihbuiuhfnjbvjzknoiewhfkbvzncldlwhflva" , "abcdefghihbuiuhfnjbvjz" , "abc" ; output ; end ; run ; %let split = 20 ; proc sql noprint ; select max (ceil (divide (length (var), &split))) into :ns from have ; quit ; data want ; set have ; array vv [&ns] $ &split ; do _n_ = 1 to &ns ; vv[_n_] = substrn (var, (_n_ - 1) * &split + 1) ; end ; run ; Kind regards Paul D.

Online Status	Offline
Date Last Visited	‎05-20-2024 11:09 PM

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: single quote in macro variable

Re: Join and aggregation of large tables - is there a faster way?

Re: Sort only part of a _temporary_ Array

Re: Sort only part of a _temporary_ Array

Re: Sort only part of a _temporary_ Array

Re: Sort only part of a _temporary_ Array

Re: Sort only part of a _temporary_ Array

Re: Self evaluation

Re: Self evaluation

Re: single quote in macro variable

Re: single quote in macro variable

Re: Join and aggregation of large tables - is there a faster way?

Re: How to "Split Data" (By Group Processing)

Re: single quote in macro variable

Re: Join and aggregation of large tables - is there a faster way?

Re: Sort only part of a _temporary_ Array

Re: Sort only part of a _temporary_ Array

Uniform Hashing of Arbitrary Input Into Key-Exclusive Segments

The Hash Object, Recursion and Documentation

Clearing Up Some Hash Object Mysteries - A Trialogue

Splitting a SAS data set based on the value of a variable

Beyond Table Look-up: The Versatile SAS Hash Object

Re: how to get total previous 5 days dose by subject date ?

Re: Assigning Multiple Variables

Re: Aggregation using hashing: counting distinct occurrences based on ...

Re: The memrc argument in the hash object definedone method

Re: The memrc argument in the hash object definedone method

Re: The memrc argument in the hash object definedone method

Re: Drug Overlap

Re: Use do loop index in new variable name

Re: Drug Overlap

Re: do loops in arrays outputting multiple observations?

Re: Summing based on 2 Pairs (Airport Data)

Re: Summing based on 2 Pairs (Airport Data)

Re: Map and Aggregate Data

Re: How to split a variable into 200 Character without chopping a word...

Re: How to split a variable into 200 Character without chopping a word...