About hashman

hashman · ‎11-11-2019

@Tom: I have no clue. It just occurred to me to do it this way, and it didn't occur to me to use a picture format ;). Thanks for the reminder.

hashman · ‎11-11-2019

@PaigeMiller : Two points: When datetime/timestamp variables come from MS SQL server, SAS auto-assigns a DATETIMEw.d format to them. Hence, there's no need to rely on variable names to determine which ones need to be reformatted, as it's enough to find out which ones have formats starting with "DATETIME". Reassigning the formats the way you suggest will rewrite a potentially large data set. It's unnecessary since to reassign a format, it suffices to modify only the data set's descriptor via either proc DATASETS or proc SQL without touching the actual data. Kind regards Paul D.

hashman · ‎11-11-2019

@Sathish_jammy: The issue is two-pronged: Your values are datetimes but you want to format them as dates. The only canned format I know of that does that is E8601DN. However, it displays datetimes as YYYY-MM-DD (which to me is the only sane way to format dates); yet you want it as MM/DD/YYYY. If you okay with YYYY-MM-DD instead, then E8601DN. can be used directly in reformatting code. If not, you'll have to create a format formatting date times as MM/DD/YYYY via creating a function with proc FCMP and using it to compile a special format. Whatever decision is made above, use SQL to list the variables whose format begins wiht "DATETIME" and compose the corresponding ALTER statement. First, suppose you're okay with YYYY-MM-DD (HAVE is just a test data set with datetime variables formatted as DATETIME22.3 and some other unrelated variables): data have ; dt = "10NOV2019:00:00:00"dt ; date_a = dt + 123456789 ; b_date = date_a + 123456789 ; format dt date_a b_date datetime22.3 ; retain otherc "otherc" othern1 111 othern2 222 ; run ; proc sql noprint ; select catx (" ", name, "format=e8601dn10.") into :dtv separated by "," from dictionary.columns where libname = "WORK" and memname = "HAVE" and format eqt "DATETIME" order varnum ; alter table have modify &dtv ; quit ; If you run this code and look at HAVE, you'll see that the datetime variables are displayed as YYYY-MM-DD. Now suppose that you still want the datetimes to be displayed as MM/DD/YYYY. In this case, you need to create a special format and then use it in SQL instead of E8601DN. This is one way to do it: proc fcmp outlib = work.f.f ; function dtfmt (dt) $ ; return (put (datepart (dt), mmddyy10.)) ; endsub ; quit ; option cmplib = work.f ; proc format ; value dtfmt (default=10) other=[dtfmt()] ; run ; proc sql noprint ; select catx (" ", name, "format=dtfmt.") into :dtv separated by "," from dictionary.columns where libname = "WORK" and memname = "HAVE" and format eqt "DATETIME" order varnum ; alter table have modify &dtv ; quit ; Now if you look at HAVE, you'll see that the datetime variables appear formatted as MM/DD/YYYY. EDIT: As @Tom has pointed out, you don't have to create the DTFMT. format using proc FCMP; you can create a picture format instead, for example: proc format ; picture dtfmt (default=10) low-high = '%0m/01/%Y' (datatype=datetime) ; run ; Kind regards Paul D.

hashman · ‎11-10-2019

@ChrisNZ: Amen to all that.

hashman · ‎11-10-2019

@jksthomas: You've got plenty of good offers, of which I'd prefer those based on the double DoW-loop and grouping by A C. Another approach can be: While passing through the file, memorize the observation number endpoints of each consecutive "N" sequence At the break of the sequence (i.e. when C="N" or LAST.A), if the distance between the endpoints GE 2, use them to output the needed records. Note that below, I've included an extra BY-group (A=2) to test the condition when a qualifying "N" sequence ends at the end of a BY A group. data have ; input (A B C) (:$1.) ; cards ; 1 1 N 1 2 N 1 3 Y 1 4 N 1 5 N 1 6 N 1 7 Y 1 8 Y 1 9 N 2 0 N 2 1 N 2 2 Y 2 3 N 2 4 N 2 5 N 2 6 Y 2 7 N 2 8 N 2 9 N 3 1 Y 3 2 N 3 3 N 3 4 N 3 5 N 3 6 N 3 7 N 3 8 Y 3 8 Y ; run ; data want ; set have ; by a ; if c eq "N" and not _iorc_ then _iorc_ = _n_ ; if (c ne "N" or last.a) and _iorc_ ; if _n_ - not last.a - _iorc_ => 2 then do p = _iorc_ to _n_ - not last.a ; set have point = p ; output ; end ; _iorc_ = 0 ; run ; _IORC_ (auto-retained, auto-dropped, and set to 0 at compile) tracks the record numbers where the "N" sequences begin; _N_ tracks the record numbers where they terminate. This approach can be advantageous performance-wise if the number and size of the qualifying "N" sequences are small compared to the size of the file, as it thus avoids a full second read. Kind regards Paul D.

hashman · ‎11-09-2019

@ChrisNZ: It's simplest and fastest in this simple case. In the more common BY case it gets less simple since the records in each BY group have to be enumerated first, i.e., for example: data have ; input id var ; cards ; 1 1 2 2 2 3 3 4 3 5 3 6 ; run ; data v (keep=id var seq) / view = v ; set have ; by id ; if first.id then seq = 1 ; else seq + 1 ; run ; data lead (drop = seq) ; merge have v (rename=var=lead_var where=(seq > 1)) ; by id ; if last.id then call missing (lead_var) ; run ; As far as the fastest goes, with this method the file ends up being read twice even when generating a lead is needed only for certain records upon a specific condition. E.g., picture a situation where one needs to replace VAR with its lead only when VAR is missing and such records constitute but a tiny fraction of the whole file. In such a case, it would be faster to grab the needed leads via POINT= only for those specific records, would it not? Kind regards Paul D.

hashman · ‎11-09-2019

@Krueger; There're many. IMO, the simplest (which I don't find in the link by @Reeza, nor in the old link within the link) is reading the Kth record downstream from the current using SET POINT=K. Here's an example for the most common scenario of creating K=1 lead values in every BY group: data have ; input id var ; cards ; 1 1 2 2 2 3 3 4 3 5 3 6 ; run ; data lead ; set have ; by id ; _n_ + 1 ; if not last.id then set have (keep=var rename=var=lead_var) point = _n_ ; else call missing (lead_var) ; run ; Result: id var lead_var --------------------- 1 1 . 2 2 3 2 3 . 3 4 5 3 5 6 3 6 . If the leads are to be created throughout the file with no regard to BY grouping, the idea is the same, and the code is virtually identical: data lead ; set have end = z ; _n_ + 1 ; if not z then set have (keep=var rename=var=lead_var) point = _n_ ; else call missing (lead_var) ; run ; Kind regards Paul D.

hashman · ‎11-08-2019

@aljones1816; For each record in ONE, loop through the list of the comma-delimited IDs. If any ID is in TWO, break the loop, and the ID on which the loop has stopped is your guy. Otherwise the ID value stopping the loop will be missing, which is what you need. In other (SAS) words: data one ; input id_field $35. ; cards ; a23g22,hh998 bg884 not,in,two g9932 gh994,f99g,4jgkf,fgldf ; run ; data two ; input id $ ; cards ; bg884 g9932 gh994 f99g 4jgkf ; run ; data want ; if _n_ = 1 then do ; dcl hash h (dataset:"two") ; h.definekey ("id") ; h.definedone () ; end ; set one two (obs=0) ; do _n_ = 1 by 1 until (h.check() = 0 or cmiss (id)) ; id = scan (id_field, _n_) ; end ; run ; By the nature of the algorithm, the number of output rows will equal that in the input. Kind regards, Paul D.

hashman · ‎11-07-2019

@Bromlem: Thanks. I wonder what else doesn't work there ;).

hashman · ‎11-07-2019

@Bromlem: I am not sure whether using CUROBS= option instead of _N_ will help in your situation. But I suspect that it quite possibly might because while _N_ has nothing to do with the input data set per se (it's just populated at the top of the implied DATA step loop with the number of its current iteration from an internal counter), the variable assigned to CUROBS= is linked with the physical row numbers of the data set specified in the SET statement. In fact, it reflects them accurately even in the case there's a WHERE clause or observations marked for deletion. So, what the heck, give it a shot (the variable Q will be auto-dropped): data mylasr.out ; set mylasr.in curobs = q ; rownum = q ; run ; For the sake of curiosity, you can also try the MONOTONIC() function, though I'd be very surprised if it worked: data mylasr.out ; set mylasr.in ; rownum = monotonic() ; run ; It will be interesting to see what you will have discovered. Kind regards Paul D.

hashman · ‎11-07-2019

@sassy_lm: Read the file to find out what it the largest [id,bid] group on file Use the result to size up arrays and create the requisite number of array variables for the output Read the file again by [id,bid] (it's assumed sorted accordingly) and populate the arrays using the record number in each BY group as an index, starting for each BY group at 1 In other (i.e. SAS) words: data have ; input ID BID Dx_year dx_code $ short_description & :$12. ; cards ; 1 250 2000 12345 HTN 1 250 2000 12344 obesity 1 250 2000 12333 vomiting 2 870 2002 12223 TB 2 870 2002 14322 anemia 2 870 2003 12355 tobacco use 2 870 2003 14325 infection ; run ; proc sql noprint ; select cats (max (g)) into :g from (select count (*) as g from have group id, bid) ; quit ; %put &=g ; data want (drop = dx_year dx_code short_description) ; do _i_ = 1 by 1 until (last.bid) ; set have ; by id bid ; array dxy dx_year1-dx_year&g ; array dxc $ 8 dx_code1-dx_code&g ; array sd $12 short_description1-short_description&g ; dxy = dx_year ; dxc = dx_code ; sd = short_description ; end ; run ; Note: The implicit OUTPUT before RUN writes out the array variables populated through the iterations of the DoW-loop as one record. Then program control is passed back to the top of the step and the array variables are auto-populated with missing values since they are not retained. You can kill the %PUT statement. It's just FYI. Kind regards Paul D.

hashman · ‎11-07-2019

@ChrisNZ: Have read through it and detected nothing that would deviate from my knowledge. Thanks!

hashman · ‎11-07-2019

@Shmuel: This way, in the final step you still have to read HAVE in its entirety with all its variables, while the OP says this is what causes the job to either run forever or crash.

hashman · ‎11-07-2019

@lydiawawa: What you are asking to do is easy by assigning an appropriate expression to the KEY argument tag when the CHECK method is called, for example: data have ; input @1 date $23. x y :$1. z :$3. ; cards ; 2018-04-03 03:44:18.728 1 A A01 2018-04-03 07:40:02.221 2 B B02 2018-05-03 09:20:20.135 3 C C03 2018-06-03 14:50:11.752 4 D D04 2018-07-03 02:42:17.005 5 E E05 2018-08-05 01:22:20.264 6 F F06 2018-01-06 04:45:49.402 7 G G07 2018-11-06 04:09:50.710 8 H H08 2018-07-07 04:12:31.623 9 I I09 2018-12-11 04:11:01.528 10 J J10 ; run ; data dates ; input m_yy $4. ; cards ; 4-03 1-06 ; run ; data want (drop = m_yy) ; if _n_ = 1 then do ; if 0 then set dates ; dcl hash h (dataset: "dates") ; h.definekey ("m_yy") ; h.definedone () ; end ; set have ; if h.check (key: put (substr (date, 7), $4.)) = 0 ; run ; However, from what you've said I suspect it's not the real snag. If you run into a problem just applying a WHERE clause to the input, hash code akin to the above isn't going to help you much. Most likely, in your input you have a "tail" of satellite variables, much more numerous than X Y Z I've included for the sake of a demo; and though WHERE moves only the records that qualify from the buffer to the PDV, it fails to relieve the I/O burden created by the satellites enough. With the hash code like above, the things are even worse because every record gets moved from the buffer into the PDV before the unwanted ones get discarded by the subsetting IF. Hence, the strategy needs to be adjusted depending on the nature of your input data. Just to outline two extreme scenarios, suppose that the records you end up selecting constitute: but a fraction of the whole input or a lion's share of the input In both cases, you have very few records to either (1) filter in or (2) filter out. And in both cases, it makes sense to first identify those records by observation number with as little computer resource pain as possible and (1) employ some tactic to get those you need or (2) mark those you don't want as deleted. Either way, at this stage it makes sense to drop all the satellite variables, reading in only the key, and apply the subsetting criteria to it, so that in the end you end up with a list of record IDs (i.e. the observation numbers) you either (1) want or (2) don't want. Let's first look at #1: data want (drop = m_yy) ; * hash to lookup m_yy ; dcl hash h (dataset: "dates") ; h.definekey ("m_yy") ; h.definedone () ; dcl hash r () ; * hash to store filtered-in RIDs ; r.definekey ("rid") ; r.definedone () ; dcl hiter ir ("r") ; * find needed RIDs ; do rid = 1 by 1 until (lr) ; * KEEP is critical in SET below ; set have (keep = date) end = lr ; if h.check (key: put (substr (date, 7), $4.)) = 0 then r.add() ; end ; * select only records with RIDs in hash R from HAVE ; do while (ir.next() = 0) ; set have point = rid ; output ; end ; stop ; set dates ; run ; In the extreme scenario #2, you don't want any record with M_YY in file DATES: data discard (keep = rid) ; if _n_ = 1 then do ; if 0 the set dates ; dcl hash h (dataset: "dates") ; h.definekey ("m_yy") ; h.definedone () ; end ; * KEEP is critical in SET below ; set have (keep = date) ; * find UNneeded RIDs ; if h.check (key: put (substr (date, 7), $4.)) = 0 ; rid = _n_ ; run ; data have ; set discard ; modify have point = rid ; remove ; run ; In this case, you just make a list of the unwanted RIDs in the first step (which reads nothing but the key) and in the second step, use that list to mark the respective records in the data set HAVE itself as "deleted". In this case, you (a) still have never read anything from HAVE except the key, (b) never written out a huge data set with all the satellites and only a few records discarded. You've merely marked the unwanted records as "deleted" in HAVE. So, in your program downstream you will just read the data set HAVE; and all the records marked for deletion will be automatically ignored. Of course, there're other scenarios in between these two extremes. But you should be getting the drift. When you deal with voluminous data, it's not always clearly cut and one has to be inventive; it's an art as much as a science. At times, to engineer a successful ETL, one needs to do a distribution analysis on the keys only first and then write a dynamic program smart enough to choose a subsetting tactic based on the distribution. Kind regards Paul D.

hashman · ‎11-06-2019

@Bikila: You don't need to write a macro. The macro variables you need can be obtained using the following macro expressions: %let sdate = %str(%')%sysfunc(intnx(mon,%sysfunc(date()),-2,B),yymmdd10)%str(%') ; %let edate = %str(%')%sysfunc(intnx(mon,%sysfunc(date()),-1,E),yymmdd10)%str(%') ; If you %PUT them in the log, you'll see: %put &=sdate &=edate ; SDATE='2019-09-01' EDATE='2019-10-31' And you don't even have to assign the expressions to macro variables, as they can be used directly in the query, like so: DOS_FROM >= %str(%')%sysfunc(intnx(mon,%sysfunc(date()),-2,B),yymmdd10)%str(%') OTOH, since you're using SQL Server specific language via explicit Pass-Thru, you can use SQL Server functions GETDATE(), DATEADD, CAST, EOMONTH, and so on to generate your shifted dates in the query itself. Kind regards Paul D.

Online Status	Offline
Date Last Visited	‎10-29-2025 11:48 AM

Re: The ADDRLONG function is not available beginning with SAS 9.4M9

Re: The ADDRLONG function is not available beginning with SAS 9.4M9

Re: The ADDRLONG function is not available beginning with SAS 9.4M9

Re: how to generate sequence number in sql

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: single quote in macro variable

Re: Join and aggregation of large tables - is there a faster way?

Re: Sort only part of a _temporary_ Array

Re: The ADDRLONG function is not available beginning with SAS 9.4M9

The ADDRLONG function is not available beginning with SAS 9.4M9

Re: The ADDRLONG function is not available beginning with SAS 9.4M9

Re: The ADDRLONG function is not available beginning with SAS 9.4M9

Re: The ADDRLONG function is not available beginning with SAS 9.4M9

Re: The ADDRLONG function is not available beginning with SAS 9.4M9

Re: The ADDRLONG function is not available beginning with SAS 9.4M9

Re: The ADDRLONG function is not available beginning with SAS 9.4M9

Re: Find char missing values in a dataset and remove the features with...

Re: Find char missing values in a dataset and remove the features with...

Uniform Hashing of Arbitrary Input Into Key-Exclusive Segments

The Hash Object, Recursion and Documentation

Clearing Up Some Hash Object Mysteries - A Trialogue

Splitting a SAS data set based on the value of a variable

Beyond Table Look-up: The Versatile SAS Hash Object

Re: Change all the date variable in MMDDYY10. format

Re: Change all the date variable in MMDDYY10. format

Re: Change all the date variable in MMDDYY10. format

Re: Is there a LEAD() function like in SQL?

Re: How to get the observations for subsequent values based on a condi...

Re: Is there a LEAD() function like in SQL?

Re: Is there a LEAD() function like in SQL?

Re: Check whether a dataset variable value contains any string from a ...

Re: Unique row Identifier in distributed LASR Server.

Re: Unique row Identifier in distributed LASR Server.

Re: Transpose Data; Multiple Observations into one row per subject

Re: Removing duplicates by removing reappeared chunk of records (could...

Re: How to subset a very large dataset by date

Re: How to subset a very large dataset by date

Re: How to apply DATE macro while connect to ODBC using proc sql