mkeintz Tracker

Re: Max per loan in last 7 days

mkeintz — Wed, 10 Sep 2025 03:19:02 GMT

@Ronein wrote:
Why array have 6 arguments?

The array does NOT have six elements, it has seven elements, where the lower bound is zero (not one) and the upper bound is six. When you divide the date value by 7 and keep the remainder, you will end up with 0, 1, ..., 5, or 6. So each day of the week can trivially be placed in exactly one element of the array.

Re: Max per loan in last 7 days

mkeintz — Mon, 08 Sep 2025 11:08:19 GMT

@Ronein wrote:
Why array have 30000 arguments? How dud you know to choose this number ?

If you are using an array from 0 to 30000, it covers Jan 1, 1960 through 19feb2024. You can customize the date range per below (say for 01jan2009 through 31dec2020):

%let begdate=01jan2009 ;
%let enddate=31dec2020 ;

data;
  * other sas code ;
  array history {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} _temporary_ ;
  * other sas code ;
run;

Re: Max per loan in last 7 days

mkeintz — Sun, 07 Sep 2025 20:49:58 GMT

Here's a data step that effectively maintains a moving window of offer amounts for the prior 7 days.
With each obs, first weed out stale (over 7 days prior) data.

Then just take the maximum of that small array.

Then update the arrays with the new obs

data have;
  format date ddmmyy10.;
  input custid date :date9. loansAmnt offerAmnt;
cards;
111 01JUL2025 .     50000
111 03jul2025 .     22000
111 08jul2025 .     19000
111 09jul2025 5000  .
222 01jul2025 .     40000
222 03jul2025 .     28000
222 04jul2025 13000 .
222 08jul2025 .     27000
223 09jul2025 35000 .
run;
data want (drop=_:);
  set have;
  by custid;

  array tempdat{0:6} _temporary_; /*Observed dates within prior 7 days*/
  array tempval{0:6} _temporary_; /*Observed offers within prior 7 days*/  

  /* Weed stale data first */
  if first.custid=1 then call missing(of tempval{*},of tempdat{*});
  else do while (.< min(of tempdat{*}) < date-7);    
    _d=mod(min(of tempdat{*}),7);
    call missing(tempval{_d},tempdat{_d});
  end;

  /* Get the maximum */
  if n(of tempdat{*})>0 then max_prior_7days=max(of tempval{*});

  /*Update the arrays with current obs */
  _d=mod(date,7);   
  tempdat{_d}=date;
  tempval{_d}=offeramnt;
run;

Ignore the previously sumitted code below, which didn't weed before taking the maximum:

~~data want (drop=_:);~~

~~set have; by custid;~~

~~array tempdat{0:6} _temporary_; /*Observed dates within prior 7 days*/~~

~~array tempval{0:6} _temporary_; /*Observed offers within prior 7 days*/~~

~~/* Get the maximum, when there is a 7-day window available */~~

~~if first.custid=0 and dif(date)<=7 then max_prior_7days=max(of tempval{*});~~

~~else call missing(of tempval{*},of tempdat{*});~~

~~/*Update the arrays with current obs */~~

~~_d=mod(date,7); tempdat{_d}=date;~~

~~tempval{_d}=offeramnt;~~

~~/*Weed out stale data */~~

~~do while (min(of tempdat{*})<date-7);~~

~~_d=mod(min(of tempdat{*}),7);~~

~~call missing(tempval{_d},tempdat{_d});~~

~~end;~~

~~run;~~

Two points:

The temporary arrays are indexed from 0 to 6, effectively representing a day-of-week index. So the data in the arrays are not ordered strictly chronologically, but order doesn't matter for getting the maximum. The task is to be sure to eliminate stale values.
The DIF(x) function is x-lag(x).

Re: Proc SQL

mkeintz — Fri, 05 Sep 2025 01:16:18 GMT

An interviewer insisting on PROC SQL code for the 3rd highest value is not attempting to assess your SAS expertise.

Re: PROC SQL match missing about 3% of cases even though they are in both files

mkeintz — Thu, 28 Aug 2025 00:52:21 GMT

@Wolverine

Glad you identified the problem. Mark your own explanation as the solution, so that this topic no longer appears as unsolved.

Re: Extracting all records that share variable 2 values, where at least one variable 1 value is = XX

mkeintz — Tue, 22 Jul 2025 13:56:26 GMT

If the dataset is not particularly big, then I think your best option is the hash object technique suggested by @Patrick. The code is simple.

That solution requires reading the data in two unsynchronized data streams, which could create a performance hit in the case of large datasets, due to disk activity.

Now if the data were sorted by VAR2, you could do:

data want;
  merge have (where=(var1='XX')  in=wanted) 
        have;
  by var2;
  if wanted;
run;

which synchronizes the data streams and reduces disk activity.

Your data are not sorted by VAR2, although it is grouped by VAR2. So you can generate data stream synchronization with code such as this:

data want (drop=_:);
  set have (where=(var1='XX') rename=(var2=_var2));

  do until (last.var2=1 and var2=_var2);
    set have;
    by var2 notsorted;
    if var2=_var2 then output; 
  end;
run;

The above assumes no more than one var1='XX' for any given VAR2. If you can have two or more var1='XX' cases, then:

data want (drop=_:);
  set have (where=(var1='XX') rename=(var2=_var2));
  by _var2 notsorted;

  if last._var2 then do until (last.var2=1 and var2=_var2);
    set have;
    by var2 notsorted;
    if var2=_var2 then output; 
  end;
run;

And yes, the degree of synchronization achieved would depend on the distribution of the "XX" cases.

Re: Identify patterns across observations

mkeintz — Sat, 19 Jul 2025 22:39:00 GMT

First, I assume that flag1 is based on the type variable, and flag2 is based on the type2 variable, correct?
Next: you have two observations for ID B with type='99'. They are separated by two observations with type='11' and type='1'.

So why have you assigned flag1=0 to those 99's, given they are NOT separated by a 10 or a 4, which I understand is the only reason to set flag1 to 0?

If I understand you correctly, and if my conjecture about ID=B is correct, then you can use the following code:

data want (drop=_: nxt:);
  do _n=1 by 1 until (last.id);
    set   have;
    by id;
    array _chk{30};
    if type='99' then do;
      if _last99^=. then do;
        _chk{_last99}=1; 
        _chk{_n}=1;
      end;
      _last99=_n;
    end;
    if type in ('10','4') then _last99=.;
    if type2=0 then _last_t2_eq_0_date=date;
  end;

  /* With the _chk array and _last_t2_eq_0_date in hand, reread this ID and set flags*/
  do _p=1 to _n;
    merge have
          have (firstobs=2 keep=id type2 rename=(id=nxt1_id type2=nxt1_t2))
          have (firstobs=3 keep=id type2 rename=(id=nxt2_id type2=nxt2_t2));
    if type='99' then flag1=sum(0,_chk{_p});
    else flag1=.;

    if id=nxt2_id and type2=2 and nxt1_t2=1 and nxt2_t2=0 then flag2=1;
    else flag2=.;
    if flag2=1 then date2=_last_t2_eq_0_date;
    else date2=.;
    output;
  end;
  format date2 date9. ;
run;

It reads each ID twice. the first time it checks for type='99' that are not separated by '4' or '10', and set dummies in the _CHK array accordingly. It also keeps track of the last date for type2=0.

In the second pass of data, the flags are set, using the array for flag1. For flag2 the second pass uses the firstobs= options in a MERGE statement to look ahead for a pattern of type2=2 for current, followed by a type2=1 and type2=0. Note the MERGE must NOT use a BY ID statement. If it did, then the alignment set up by the FIRSTOBS parameters would be lost at the beginning of the second ID.

Re: Sorting data within the cell

mkeintz — Sat, 12 Jul 2025 02:42:01 GMT

You can transfer the components of the cell into a hash object with the "ordered" attribute. Then iterate through them and concatenate each of them:

data have;
  input col1 col2 :$50. ;
cards;
1 1,5,4,3,2,6,3.5
2 21,3,2,1,5,15,17,3
3 1,2,3
4 4,3,2
5 1,2,junk,3
;

data want (drop=i _:);
  set have;
  if _n_=1 then do;
    declare hash h (ordered:'A',multidata:'Y');
      h.definekey('_x');
      h.definedata('_x');
      h.definedone();
    declare hiter hi ('h');
  end;

  do i=1 to countw(col2,',');
    _x=input(scan(col2,i,','),best12.);
    if _x^=. then h.add();
  end;
  length new_col $50;
  do while (hi.next()=0);
    new_col=catx(',',new_col,_x);
  end;
  h.clear();
run;

Re: Search var with best match name

mkeintz — Thu, 10 Jul 2025 23:37:33 GMT

What if you want to find the closest match to X145, from a list of variables named X14,X15,X125,X147.X345,X545?

As @PaigeMiller has said, would the use of spelling distance have any utility at all in this case?

Re: convert numeric to char

mkeintz — Mon, 07 Jul 2025 14:42:52 GMT

If you change

z_char=put(z,best.);

z_char=put(z,8.2);

then z_char will display what you want. Assuming this statement is the first reference to z_char, it will be an 8-byte character variable. It will also be right justified, and with two decimal places.

You may not want it to be right justified, in which case, you can use

z_char=left(put(z,8.2));

But remember that if you left justify the character value, then sorting by z_char may not generate the same order as sorting by z (i.e. '11.18 ' would precede '6.18 ').

--- Lexicographic ordering.

Re: set data sets with customers in pop table only

mkeintz — Tue, 01 Jul 2025 16:08:57 GMT

If each dataset is sorted by ID, then

data want (drop=_:);
  set t1 (in=in1) t2 t3;
  by id;
  retain _found_in_t1;
  if first.id then _found_in_t1=in1;
  if _found_in_t1;
run;

Re: How to Randomly Pick a Value from Other Table

mkeintz — Tue, 01 Jul 2025 00:49:02 GMT

You have a small dataset of phone numbers by state, and a "large" dataset of accounts. I understand that you are fine with randomly assigning a given phone number to multiple qualifying accounts. Here's a program that avoids the need to sort the large dataset just to facilitate a merge:

Using @Ksharp's sample data:

data phone_arrays (keep=state _nphones _col:);
  do _nphones=1 by 1 until (last.state);
    set tablea  (where=(not missing(phone)));
    by state;
    array _col {50};
    _col{_nphones}=phone;
  end;
run;

data want (drop=_:);
  set tableb;
  if _n_=1 then do;
    if 0 then set phone_arrays ;
    declare hash h (dataset:'phone_arrays');
      h.definekey('state'); 
      h.definedata(all:'Y');
      h.definedone();
  end;
  array col {*} _col: ;
  
  if h.find()=0 then phnum=col{ceil(_nphones*ranuni(1508915))};
run;

The _COL array is given an arbitrary size -- large enough to account for the largest group of available phone numbers.

This assumes the TABLEA dataset is already sorted by state.

Re: How to sort and restrict a big dataset

mkeintz — Fri, 27 Jun 2025 00:14:07 GMT

The PROC SUMMARY offers a neat compact single-pass solution. It will certainly use a lot less disk input/output resources than the PROC SORT solution, and will probably be a lot faster - assuming there is no memory constraint.

BUT ... does your data have the possibility of tied maximum time_flag values for a given ID/VAR1/VAR2?

If not, then ignore the rest of this comment.

But if it does, then the PROC SUMMARY might not likely give the same result as the PROC SORT ... if LAST.VAR2 solution. It will choose different records (with possibly different VAR3/VAR4 values) among the tied records.

This is because the default behavior of PROC SORT is to preserve the original order (from the unsorted dataset) of tied records. So that solution would always choose the latest of the tied records.

A quick test of PROC SUMMARY with ties suggests it would always choose the first of such ties. At least it did so in the test below:

data ties;
  set sashelp.class (keep=name sex age weight);
  order='First'; output;
  order='Last' ; output;
run;
proc summary nway data=ties;
  class sex age;
  output out=summ_want (drop=_:) idgroup (max(weight) out (name weight order)=);
run;
proc sort data=ties;
  by sex age weight;
run;
data sort_want;
  set ties;
  by sex age;
  if last.age;
run;

Dataset summ_want has order='First' in every output record, but the PROC SORT approach always has ORDER='Last'.

Re: Need to create JOIN variable for OVERLAPPING AEs

mkeintz — Wed, 25 Jun 2025 14:48:34 GMT

If I understand your request correctly, then:

data test;
    infile datalines truncover;
    input usubjid $3. aestdtc :$10. @17 aendt date7. ae & :$200.;
    format   aendt date9.;
datalines;
101 2024-12-18  28JAN25 Appetite lost
101 2024-12-19  26DEC24 Constipation
101 2024-12-30  10FEB25 Thrombopenia
101 2025-01-06  07FEB25 Neutropenia
101 2025-01-27  07FEB25 Neutropenia
101 2025-01-13  23MAR25 Anemia
101 2025-02-10  23MAR25 Anemia
101 2025-01-25  10FEB25 Dizziness
101 2025-01-25  10FEB25 Dyspnea
101 2025-01-25  .       Fatigue
101 2025-01-27  31JAN25 Nausea
101 2025-01-27  31JAN25 WBC decreased
101 2025-02-03  15FEB25 Rhinitis
101 2025-03-01  .       Epigastralgia
101 2025-03-01  03MAR25 Nausea
101 2025-03-01  03MAR25 Vomiting
101 2025-03-24  .       Anemia
101 2025-03-24  .       Thrombopenia
run;

DATA WANT;
  SET test;
  by usubjid ae notsorted;
  astdt=input(aestdtc,yymmdd10.);
  format astdt date9.;
  if astdt>lag(aendt) or first.ae=1 then join+1;
  if first.usubjid=1 then join=1;
run;

Do you have instances of a given AE qualifying to be assigned the same JOIN value, but occurring in non-consecutive observations? If so, this code would need to be modified.

Note the BY statement allows detection of whenever a new AE description occurs (first.ae=1).

Also please provide your sample data in a working data step. For example, your informat of date9. for aendt should have been date7.

Re: Merge/join using a date from one dataset within a date range in another

mkeintz — Mon, 23 Jun 2025 19:23:52 GMT

This is a good use case for applying conditional SET statements in a data step.

Assuming:

SET1 is sorted by ID/d_start/b/c/d
SET1 has no instances of overlapping d_start-d_end date ranges
SET2 is sorted by ID/eventstart

then

data set1;
    input id $ d_start :date9. d_end :date9.
          physcat $ a b c d e;
    format d_start d_end date9.;
    datalines;
s001 01JAN2020 31DEC2020 A 1 2 3 1 0
s001 01JAN2020 31DEC2020 A 1 3 3 1 0
s001 01JAN2021 31DEC2021 B 1 3 3 2 1
s002 01JUN2019 31MAY2020 A 2 1 2 1 1
s002 01JUN2020 31MAY2021 C 2 2 2 2 0
s003 01JAN2020 31DEC2022 B 1 3 4 1 0
run;


data set2;
input id $ eventstart :date9.;
format eventstart date9.;
datalines;
s001 15JUN2020
s001 20JUL2021
s002 15AUG2019
s002 01JUL2020
s003 01JAN2021
s003 01JAN2023
run;

data need /view=need  /*Keep lowest B C D for each ID/D_START*/;
  set set1;
  by id d_start b c d;
  if first.d;
run;

data want (drop=_:);
  set need (keep=id d_start in=in1 rename=(d_start=_ref_date))
      set2 (                in=in2 rename=(eventstart=_ref_date));
  by id _ref_date;

  retain _left_sentinel .;
  if in1 then set need ;
  retain _right_sentinel ' ';

  if in2 ;
  set set2;
  if first.id or eventstart>d_end then call missing(of _left_sentinel--_right_sentinel);
run;

This will also accommodate multiple events within a date range.

Re: Unexpected Results from PROC SQL Left Join

mkeintz — Sat, 07 Jun 2025 03:13:24 GMT

Not only is a missing value still a value that can be used as a key, as @Kurt_Bremser says, but there are many other possible missing values. There are the 26 values of .A,.B, .... .Z, and also ._ (dot underscore). These can be used for special purposes, if you want. I don't think users would want SAS to automatically assume that such values should be ignored by default, as if they "have no value" in database tasks, even if they are "ignored" in statistical analysis.

Re: Unexpected Results from PROC SQL Left Join

mkeintz — Fri, 06 Jun 2025 20:15:57 GMT

Your left join is generating a cartesian cross of instances of FORMID_FUP=. in both datasets if you have multiple instances of missing value in the LEFT dataset. MERGE ... BY, on the other hand, does not do this.

And since you suggest, in the case of NON-missing FORMID_FUP, that using MERGE ...; BY FORMID_FUP; yields what you want (and what you get) from the PROC SQL ... LEFT JOIN, then it must be that the LEFT dataset has exactly one record per non-missing FORMID_FUP value.

My question is why do you want cases with missing FORMID_FUP? Why not exclude those cases from the join? ... As in (see the "where=" dataset name parameters below):

proc sql;
		create table fup_timing as
		select a.*,
				t.Procedure_Date as t_Proc_Date label = "Followup Timing:  Date Procedure from Proc Form",
				t.Schedule as t_ScheduleCat label = "Followup Timing: Standard vs. Specialized",
		from followup (where=(formid_fup^=.))             as a
		left join followup_timing (where=(formid_fup^=.)) as t
		on a.FormID_FUP= t.FormID_FUP;
	quit;

Re: how to generate sequence number in sql

mkeintz — Fri, 30 May 2025 02:24:31 GMT

@Prashan wrote:

Consider SASHELP.CARS dataset and give the sequence number for MAKE variable.
Ex:- in MAKE variable, I want sequence number for AUDI

... stuff deleted ...

that too only with PROC SQL, not with data step, I know how to do with data step.

I think there are three statements relevant to this problem.

There is NO way to reliably reproduce within-group physical sequence numbers in PROC SQL using supported tools. By "reliable" I mean reproducible with certainty.
And even if you were to use monotonic(), there is no way to reproduce the results you would get in a DATA step (see my other note) unless the data were already sorted by the grouping variable. And the use of proc sql with monotonic() would require filtering the source dataset once for each group (i.e. 38 times in the case of MAKE from sashelp.cars). A big waste of resources.
But probably the most important advice is to resist the atavistic urge to use PROC SQL for a purpose that it is totally unsuited for.

Re: how to generate sequence number in sql

mkeintz — Thu, 29 May 2025 21:03:53 GMT

But @dxiao2017 .

The dataset example you are using is already sorted by SEX, as presented by the OP. But that may be unlike most situations (and unlike sashelp.class which is the source of the data example, and is sorted by name, not by sex/name.).

Besides there is no need to sort the data by sex, merely to generate within-sex sequence numbers.

For instance:

data want (drop=_:);
  set sashelp.class;
  _nm+(sex='M');
  _nf+(sex='F');
  sequence=ifn(sex='M',_nm,_nf);
run;

And if one MUST use SQL, then the undocumented (read "unsupported") MONOTONIC() function (see MONOTONIC-function-in-PROC-SQL) can be used for each sex:

proc sql;
  create table want as 
  select name, monotonic() as seq from sashelp.class where sex='M'
  union corr
  select name, monotonic() as seq from sashelp.class where sex='F'
  ;
quit;

But note that data order in the case of PROC SQL will almost certainly not be the same as in the original data set. And that the row order will actually change depending on the order of variables

Re: Guidance Needed: Merging Two SAS Datasets with Inconsistent Key Formats

mkeintz — Sat, 24 May 2025 03:51:16 GMT

This can be done in a single data step that first reads DATASET1, stores it in a hash object (lookup table) with a lookup key created by modifying X (remove leading non-digits, shorten it to the last 5 digits if necessary, and remove leading zeroes. This is followed by reading DATASET2, where X is similarly modified, and a lookup is performed to see whether it is in the hash object, from which the PRODUCT value is retrieved:

data dataset1 (label='x with 3 or 5 digits');
  infile datalines missover; 
  input product $4.  x :$20. ;
datalines;
via1 
via2 003
via3 014
via4 GA4
via5 GA015
via6 319
via7 23456
via8 10101010198765
run;

data dataset2;
  infile datalines ;
  input name $1.  x :$20. ;
datalines;
a 2
b 3
c 14
d 4
e 15
f GF319
g 23456
h 98765
run;

data want (drop=rc);;
  set   dataset1 (in=in1) dataset2 (in=in2);
  where x^='';
  if _n_=1 then do;
    declare hash d1 ();
      d1.definekey('x');
      d1.definedata('product');
      d1.definedone();
  end;

  x=substr(x,anydigit(x)); /*Remove leading non-digits*/
  if length(x)>5 then x=substr(x,length(x)-4);  /*If too long, take last 5 digits*/
 
  do while (x=:'0'); /* Strip leading zeroes*/
    x=substr(x,2);
  end;
  if in1=1 then d1.add();
  if in2;
  rc=d1.find();
run;