SAS Programming

Shafian · Posted 06-24-2022 06:11 AM

Hello there,

I pratice SAS for several years, but I am not an expert at all especially when dealing with loops and double &.

I think about the following problem for quite some time, and tried a lot of keyword research online, with no success.

I want to solve this problem only in SAS, not in another software.

Please consider the following table (it is my first post sorry if the format is not good):

data have;
  input ID	Base	Reach1	Reach3	Reach6	Reach10	Reach15 :BEST12.;
cards;

1	5	1	1	1	9	10
2	6	1	1	1	8	9
3	8	1	1	2	7	8
4	4	1	1	1	6	7
5	9	1	3	4	5	6
6	2	1	2	3	4	5
7	1	1	1	2	3	4
8	5	1	1	1	2	3
9	6	1	1	1	1	2
10	10	1	1	1	1	1
11	25	1	2	3	.	.
12	2	1	1	2	.	.
13	4	1	1	1	.	.
14	6	.	.	.	.	.
;run;

The core dataset are the two first columns, ID and Base. The important column is Base.

The following columns Reachx are the ones I want to construct with SAS code.

I want, considering the column Base, for each row and for a specific number (which can be another numeric variable), to find in the next rows the first time the number or a superior number occur, (value = that row number after the current row).

Exemples
For Reach1, how much row(s) I need to go down in the column Base to find a 1 or higher
For Reach3, how much row(s) I need to go down in the column Base to find a 3 or higher
For Reachx, how much row(s) I need to go down in the column Base to find a x or higher

Plot twist: I have 800k rows

I tried looking forward with lag in a reverse order table; there is probably something to do with a loop but it is not my level of expertise.
I tried looking for proc expand, which solved some of my similar problems, but not this one.

Can you help me ?

Thank you for spending time to read and answer neebies questions, either or not you answer this one 🙂

Regards
Shaf

Ksharp · Posted 06-25-2022 06:57 AM

This question is very interesting. I like it !

data have;
  input ID	Base	;
cards;
1	5	
2	6	
3	8	
4	4	
5	9	
6	2	
7	1	
8	5	
9	6	
10	10	
11	25	
12	2	
13	4	
14	6	
;run;

data want;
 if _n_=1 then do;
   if 0 then set have(rename=(base=_base));
   declare hash h(dataset:'have(rename=(base=_base))',ordered:'y');
   declare hiter hi('h');
   h.definekey('id');
   h.definedata('_base');
   h.definedone();
 end;
set have;
Reach1=1;	Reach3=1;	Reach6=1;	Reach10=1;	Reach15=1;
_id=id+1;
rc=hi.setcur(key:_id);
if rc ne 0 then call missing(of Reach:);
do while(rc=0);
 if not found1 then do;if _base<1 then Reach1+1; else found1=1;end;
 if not found3 then do;if _base<3 then Reach3+1; else found3=1;end;
 if not found6 then do;if _base<6 then Reach6+1; else found6=1;end;
 if not found10 then do;if _base<10 then Reach10+1; else found10=1;end;
 if not found15 then do;if _base<15 then Reach15+1; else found15=1;end;
 rc=hi.next();
end;
 if not found1 then call missing(Reach1);
 if not found3 then call missing(Reach3);
 if not found6 then call missing(Reach6);
 if not found10 then call missing(Reach10);
 if not found15 then call missing(Reach15);

drop _: rc found:;
run;

View solution in original post

ballardw · Posted 06-24-2022 12:44 PM

Is this a reduced example of a larger problem? If so, are the ID always sequential with no repeats AND how many records might be involved?

I have an idea for a possible solution but it may not extend to "large" data.

ballardw · Posted 06-24-2022 01:30 PM

This seems to work for your example data:

data have;
  input ID	Base	;
cards;

1	5	
2	6	
3	8	
4	4	
5	9	
6	2	
7	1	
8	5	
9	6	
10	10	
11	25	
12	2	
13	4	
14	6	
;run;

proc transpose data=have out=trans(drop=_name_)
   prefix=Z;
   var base;
   id id;
run;

data makeit;
   set trans;
   array z(*) z: ;
   do id=1 to dim(z);
      base=z[id];
      do reach = 1, 3, 6,10, 15;
         if id< dim(z) then do j=(id +1) to dim(z);
            if z[j] ge reach then do;
               rows= j-id;
               output;
               leave;
            end;
            else if j=dim(z)   then do;
               rows= .;
               output;
            end;
         end;
         else do;
            rows=.;
            output;
         end;
      end;
   end;
run;

proc transpose data=makeit out=want (drop=_name_)
   prefix=Reach;
   by id base;
   id reach;
   var rows;

run;

Warnings: Assumption 1: Id is sequentially numbered

Assumption 2: Id is not repeated

Transpose with large data sets may cause issues with time and memory use.

I use transpose to reshape the data into a relatively known form with all the values in one observation. Then use an array to hold all those values so then can be used and iterated.

The data step Makeit only needs you to provide the list of variable values to read on the Do Reach = ....; line of codes. If you repeat a value in the list of Reach values then the following transpose will fail. Negative values or decimals will also likely fail when the transpose attempts to use them for variables.

There arrays have some timing issues. I suggest working through the what is output when carefully. The placement of OUTPUT statements is important. If you have not seen the LEAVE instruction it is a way to quit a loop when a condition is met other than the basic loop iteration values. So it stops searching when the first value matching the condition is met.

Then use Transpose to reshape the data back using the values of the Reach variable to rename things.

Shafian · Posted 06-24-2022 02:12 PM

@ballardw wrote:
Is this a reduced example of a larger problem?

No, the explained issue is currently my main issue, at a bigger scale (number of rows) and with some custom "Reachx" variables/values (eg (x = Base + 0.5*Base) or (x = Base - 0.5*Base))

@ballardw wrote:
If so, are the ID always sequential with no repeats

Yes, the ID are sequential with no repeat

@ballardw wrote:
how many records might be involved?

800k rows

ballardw · Posted 06-27-2022 08:04 PM

@Shafian wrote:

@ballardw wrote:

Is this a reduced example of a larger problem?

No, the explained issue is currently my main issue, at a bigger scale (number of rows) and with some custom "Reachx" variables/values (eg (x = Base + 0.5*Base) or (x = Base - 0.5*Base))

@ballardw wrote:

If so, are the ID always sequential with no repeats

Yes, the ID are sequential with no repeat

@ballardw wrote:

how many records might be involved?

800k rows

I haven't transposed any data sets with 800k rows lately. My approach is going to attempt to create and use 800k variables. Which SAS will do but takes about 5 seconds just to create an empty data set with one record like that. So there will be some time involved. I might suggest testing it the approach once with the data set option OBS=10000 on the first Proc Transpose for the existing data to see how long that runs. Use that to make a guess as the duration of the the full data set. It might be in the "take a coffee break" amount of time.

CarmineVerrell · Posted 06-24-2022 01:08 PM

Not sure if this is what you are looking for ....please take a look. Created an array with do loops to complete. Storing the next highest value in a new column called xhreach1 ...

data have;
input ID Base Reach1 Reach3 Reach6 Reach10 Reach15 :BEST12.;
array txtcol(5) $7 _temporary_ ("Reach1" "Reach3" "Reach6" "Reach10" "Reach15");
array Reach(5) Reach1 Reach3 Reach6 Reach10 Reach15;
array xhReach(5) $7 xhReach1 xhReach3 xhReach6 xhReach10 xhReach15;
retain prevalue 0;
do j=1 to 5;
/* if reach(j) = . then do; found=1 ; end; */
if reach(j) ne . then do; prevalue=reach(j);end;
found=0;
do i=1 to 5;
if reach(i) ne . and prevalue < reach(i) and found ne 1 then
do;
xhReach(j)=txtcol(i);
found=1;

end;
if i=5 and found=0 then xhReach(j)="N/A";
end;
end;

cards;

1 5 1 1 1 9 10
2 6 1 1 1 8 9
3 8 1 1 2 7 8
4 4 1 1 1 6 7
5 9 1 3 4 5 6
6 2 1 2 3 4 5
7 1 1 1 2 3 4
8 5 1 1 1 2 3
9 6 1 1 1 1 2
10 10 1 1 1 1 1
11 25 1 2 3 . .
12 2 1 1 2 . .
13 4 1 1 1 . .
14 6 . . . . .
;run;

Ksharp · Posted 06-25-2022 06:57 AM

This question is very interesting. I like it !

data have;
  input ID	Base	;
cards;
1	5	
2	6	
3	8	
4	4	
5	9	
6	2	
7	1	
8	5	
9	6	
10	10	
11	25	
12	2	
13	4	
14	6	
;run;

data want;
 if _n_=1 then do;
   if 0 then set have(rename=(base=_base));
   declare hash h(dataset:'have(rename=(base=_base))',ordered:'y');
   declare hiter hi('h');
   h.definekey('id');
   h.definedata('_base');
   h.definedone();
 end;
set have;
Reach1=1;	Reach3=1;	Reach6=1;	Reach10=1;	Reach15=1;
_id=id+1;
rc=hi.setcur(key:_id);
if rc ne 0 then call missing(of Reach:);
do while(rc=0);
 if not found1 then do;if _base<1 then Reach1+1; else found1=1;end;
 if not found3 then do;if _base<3 then Reach3+1; else found3=1;end;
 if not found6 then do;if _base<6 then Reach6+1; else found6=1;end;
 if not found10 then do;if _base<10 then Reach10+1; else found10=1;end;
 if not found15 then do;if _base<15 then Reach15+1; else found15=1;end;
 rc=hi.next();
end;
 if not found1 then call missing(Reach1);
 if not found3 then call missing(Reach3);
 if not found6 then call missing(Reach6);
 if not found10 then call missing(Reach10);
 if not found15 then call missing(Reach15);

drop _: rc found:;
run;

mkeintz · Posted 06-26-2022 05:10 AM

I like the idea of a hash object. Here is a usage that iterates through the object twice: the first time simply records all the base values for each sequence number (i.e. the equivalent of _N_ is a data step). The second time steps through the sequence values in reverse (i.e. from last base value to first). This provides a means to record, for each row in the object, the row numbers of the next 3 or higher, next 6 or higher, etc. base values.

One of the advantages is that this object does not need the overhead of forcing the object to stored in any particular order. Being keyed on sequence number is enough.

data have (keep=id base)  want0 (label='To compare to results of code below');
  input ID	Base	Reach1	Reach3	Reach6	Reach10	Reach15 :BEST12.;
cards;
1	5	1	1	1	9	10
2	6	1	1	1	8	9
3	8	1	1	2	7	8
4	4	1	1	1	6	7
5	9	1	3	4	5	6
6	2	1	2	3	4	5
7	1	1	1	2	3	4
8	5	1	1	1	2	3
9	6	1	1	1	1	2
10	10	1	1	1	1	1
11	25	1	2	3	.	.
12	2	1	1	2	.	.
13	4	1	1	1	.	.
14	6	.	.	.	.	.
run;

data want (drop=_:);
  if _n_=1 then do;  
    if 0 then set have;
    call missing(_seq,_next1,_next3,_next6,_next10,_next15);
    declare hash bases();
      bases.definekey('_seq');
      bases.definedata('_seq','base','_next1','_next3','_next6','_next10','_next15');
      bases.definedone();
    do _seq=1 by 1 until (end_of_bases);    /* Populate the hash object */
      set have (keep=base) end=end_of_bases;
      bases.add();
    end;
    _nseq=_seq;

    do _seq=_nseq to 1 by -1;  /* For each SEQ, determine row numbers of next 3 or higher, etc. */
      bases.find();            /*Get current BASE value */

      /* Now rewrite the object dataitem with with NEXT row numbers for each base value*/
      bases.replace(key:_seq,data:_seq,data:base,data:_n1,data:_n3,data:_n6,data:_n10,data:_n15);
      array _n {15};
      do _b=1,3,6,10,15;  /* Now update next row numbers base on current BASE, if needed */
        if base>=_b then _n{_b}=_seq;
      end;
    end;
  end;

  set have;
  bases.find(key:_n_);  /*Retreive the _NEXT values */
  if _next1^=. then reach1=_next1-_n_;
  if _next3^=. then reach3=_next3-_n_;
  if _next6^=. then reach6=_next6-_n_;
  if _next10^=. then reach10=_next10-_n_;
  if _next15^=. then reach15=_next15-_n_;
run;

proc compare base=want0 compare=want notes noprint;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 06-26-2022 06:17 AM

While thinking a little more about the use of hash objects for this task, it seems that setting up a set of forward links for each BASE level of interest would work and be about as efficient as possible. Each link object would have entries only for qualifying observations (i.e. not for every observation)., keyed on the observation number and pointing to the next qualifying observation. This is a small footprint.

And, though use of the lag function in keying the hash objects, they can be completely built in a single pass of the data. No need to do a reverse process, as in my other suggestion. And no need to do hash iterator loops for each incoming observation to determing the REACH values:

data have (keep=id base)  want0 (label='To compare to results of code below');
  input ID	Base	Reach1	Reach3	Reach6	Reach10	Reach15 :BEST12.;
cards;
1	5	1	1	1	9	10
2	6	1	1	1	8	9
3	8	1	1	2	7	8
4	4	1	1	1	6	7
5	9	1	3	4	5	6
6	2	1	2	3	4	5
7	1	1	1	2	3	4
8	5	1	1	1	2	3
9	6	1	1	1	1	2
10	10	1	1	1	1	1
11	25	1	2	3	.	.
12	2	1	1	2	.	.
13	4	1	1	1	.	.
14	6	.	.	.	.	.
run;

data want (drop=_:);
  if _n_=1 then do;
    retain _curr _next1 _next3 _next6 _next10 _next15 .;
    /* Define a forward Link table for each BASE value of interest */
    declare hash L1();  L1.definekey('_curr');  L1.definedata('_curr','_next1');   L1.definedone();
    declare hash L3();  L3.definekey('_curr');  L3.definedata('_curr','_next3');   L3.definedone();
    declare hash L6();  L6.definekey('_curr');  L6.definedata('_curr','_next6');   L6.definedone();
    declare hash L10(); L10.definekey('_curr'); L10.definedata('_curr','_next10'); L10.definedone();
    declare hash L15(); L15.definekey('_curr'); L15.definedata('_curr','_next15'); L15.definedone();

    do _seq=1 by 1 until (end_of_have);  /* Populate link tables, using lag of _SEQ as key to current _SEQ */
      set have end=end_of_have;
      if base>=1 then do;
        L1.add(key:lag(_seq),data:lag(_seq),data:_seq);
        if base>=3 then do;
          L3.add(key:lag(_seq),data:lag(_seq),data:_seq);
          if base>=6 then do;
            L6.add(key:lag(_seq),data:lag(_seq),data:_seq);
            if base>=10 then do;
              L10.add(key:lag(_seq),data:lag(_seq),data:_seq);
              if base>=15 then L15.add(key:lag(_seq),data:lag(_seq),data:_seq);
            end;
          end;
        end;
      end;
    end;
    /* Now initialize _NEXT values for beginning of the data */
    L1.find(key:.);
    L3.find(key:.);
    L6.find(key:.);
    L10.find(key:.);
    L15.find(key:.);
  end;
  /* Now read the data and use the LINK tables.  If _rc^=0 then most previous valid _NEXT values are retained */
  set have;
  _rc=L1.find(key:_n_);   if _next1>_n_  then reach1 =_next1-_n_;
  _rc=L3.find(key:_n_);   if _next3>_n_  then reach3 =_next3-_n_;
  _rc=L6.find(key:_n_);   if _next6>_n_  then reach6 =_next6-_n_;
  _rc=L10.find(key:_n_);  if _next10>_n_ then reach10=_next10-_n_;
  _rc=L15.find(key:_n_);  if _next15>_n_ then reach15=_next15-_n_;
run;  /* 49 * ; */

proc compare base=want0 compare=want notes noprint;
run;

If you want to see the contents of the link tables, then you can (using table L6 as an example), change

    declare hash L6();

statement to

declare hash L6(ordered:'A');

which will organize the data in a user-friendly order. And also add an output method adjacent to the corresponding initial find method, producing this code

    L6.find(key:.);    L6.output(dataset:'L6');

which will output the hash contents to a dataset for viewing.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Shafian · Posted 06-29-2022 04:35 AM

Thank you for all your responses !

I tried, according to my knowledge (meaning if for me something doesnt work it is maybe my fault), all the solutions
according to my data.

A few key points :
-proc transpose on a 800k table doesnt work ; the proc seems ok, but when i try to open the table, infinite lag

-I specified (perhaps not enough) in my original message that the x of the Reachx can be another column (eg Reachx = Base*1.05, meaning there is Base, Reachx AND the colomn with the wanted number of row) ; in that way I cannot make the 2 proposed solutions from @mkeintz (I dont know how to tag) work with a variable column as Reachx (which change each row, see eg above)

-The solution from @Ksharp work with a variable column, but it is unpractipable for a large table like mine ; for 30k rows it took 30sec, for 50k 1m30sec. To keep the continuity of the table it is probably possible tu cut the table into pieces and to make an overlap reconstruction happend after this solution for each cut, but it is not my priority now. But it is good to know !

I am very happy that at least one solution kinda work for me.

I hope this will help others for future similar questions.

I will accept the @Ksharp answer soon.

Thank you 🙂

SAS Programming

Finding in the next rows the first time a number (or higher) occurs

Re: Finding in the next rows the first time a number (or higher) occurs

Re: Finding in the next rows the first time a number (or higher) occurs

Re: Finding in the next rows the first time a number (or higher) occurs

Re: Finding in the next rows the first time a number (or higher) occurs

Re: Finding in the next rows the first time a number (or higher) occurs

Re: Finding in the next rows the first time a number (or higher) occurs

Re: Finding in the next rows the first time a number (or higher) occurs

Re: Finding in the next rows the first time a number (or higher) occurs

Re: Finding in the next rows the first time a number (or higher) occurs

Re: Finding in the next rows the first time a number (or higher) occurs

Follow Us

What is...

SAS Programming

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...