BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
bnarang
Calcite | Level 5

Hi All

My question is related to applying the logic. Suppose, I have a dataset which contains the time series information of some customers. So, i make certain check on each customers latest information. i.e each customer last row and if condition true the value is assigned. But I want to populate the same value to the above rows for the same customer. I give an example here:

This is what I have.

ID      varb     varc

1       a          db

1       b          db

1      c           db

2      a         

2      a

3      a          db

3      a          db

3      b          db

3      c          db

So, I want to check each ID (1, 2, 3). If the value of last.Id eq 'c' then varc = 'c'. But it populates for that particular row. i want it to populate for the previous rows as well. How can it be done efficiently?

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

Here is one way it could be done:

data have;

  infile cards truncover;

  input ID  (varb varc) ($);

  cards;

1       a          db

1       b          db

1      c           db

2      a        

2      a

3      a          db

3      a          db

3      b          db

3      c          db

;

data want (drop=_:);

  do until (last.id);

    set have;

    by id;

    if last.id and varb eq 'c' then _hold='c';

  end;

  do until (last.id);

    set have;

    by id;

    if not missing(_hold) then varc=_hold;

    output;

  end;

run;

View solution in original post

9 REPLIES 9
art297
Opal | Level 21

Here is one way it could be done:

data have;

  infile cards truncover;

  input ID  (varb varc) ($);

  cards;

1       a          db

1       b          db

1      c           db

2      a        

2      a

3      a          db

3      a          db

3      b          db

3      c          db

;

data want (drop=_:);

  do until (last.id);

    set have;

    by id;

    if last.id and varb eq 'c' then _hold='c';

  end;

  do until (last.id);

    set have;

    by id;

    if not missing(_hold) then varc=_hold;

    output;

  end;

run;

Ksharp
Super User

The fast way is Hash Table if you are familliar with it.

Another way is picking up these "  last.Id eq 'c'  " observations , and merge them back into original dataset  by using MERGE statement.

At this time, I want take some idea from ArthurT, Array in dataset skill which is extremely fast.

data have;
  infile cards truncover;
  input ID  (varb varc) ($);
  cards;
1       a          db
1       b          db
1      c           db
2      a        
2      a
3      a          db
3      a          db
3      b          db
3      c          db
;
run;
data want;
 set have;
 by id;
 array b{100000} $ _temporary_;
 array c{100000} $ _temporary_;
 n+1;
 b{n}=varb;c{n}=varc;
 if last.id  then do;
                    if varb='c' then do;
                      do i=1 to n;
                        varb=b{i};
                        varc='c';
                        output;
                         n=0;
                       end;
                             end;
                     else  do;
                       do i=1 to n;
                         varb=b{i};
                         varc=c{i};
                         output;
                         n=0;
                        end;
                             end;
                    end;
run;
 

Ksharp

Message was edited by: xia keshan

bnarang
Calcite | Level 5

Thanks Arthur for a quick reply. , Can you please take the above example and explain about Hash Merge, i am not aware about it. It will be of great help.

Thanks

Ksharp
Super User

Using the code , I think it is fastest. And Don't forget to check the result.

Haikuo
Onyx | Level 15

Ok, I am not capping anything here, just trying to offer some comments:

From all of the solutions possible for your scenario, at least 2 passes (reading data twice) will inevitably be involved. First pass has to be your data storage (normally your hard drive) to your CPU, the only difference lies upon the second pass, whether it is still storage-CPU or RAM-CPU, obviously the latter is more efficient. However, there are other factors to be considered in addition to performance, such as robustness, simplicity and easiness of programming.

For the most efficient approach, go with Ksharp. However, even though the maximum obs per id is unlikely to exceed 100k, for a pure perfectionist of programming (Art will laugh at this), you may still not be completely comfortable, and then you should go with Art. Besides, if you have a lot more variables other than just one ‘varb’, Ksharp ‘s solution will have to engage multidimensional array as well as bunch of hard coding.

Hash() will come in handy at first sight. It is more efficient than Art’s DOW, while circumventing the possible hard coding when involving many more variables. However, it requires that your whole dataset to be fit into the RAM at once, which could be a problem if you have a huge dataset and a limited RAM.

Also, note that performance wise, the edge of array() or Hash() will only become prominent if your dataset has lots of variables( a wide table) and lots of records (a long table).

In general, I will stay with data step even though Proc SQL is offering a solution.

 

data have;

  infile cards truncover;

  input ID  (varb varc) ($);

  cards;

1       a          db

1       b          db

1      c db

2      a

2      a

3      a db

3      a db

3      b db

3      c db

;

/*hash*/

data want;

   if _n_=1 then do;

     if 0 then set have;

       length _varc $8.;

       declare hash h(dataset:'have', multidata:'y');

     h.definekey('id');

       h.definedata(all:'y');

       h.definedone();

      end;

      set have;

        by id;

        if last.id then do;

if varb='c' then do;

_varc='c';

            _rc=h.find();

            do _rc=0 by 0 while (_rc=0);

                     varc=_varc;

                     output;

                     _rc=h.find_next();

                  end;

end;

             else do;

_rc=h.find();

do _rc=0 by 0 while (_rc=0);

                     output;

                     _rc=h.find_next();

                   end;

             end;

        end;

drop _:;

run;

/*sql*/

proc sql noprint;

  create table want as

    select a.id, a.varb, (case when b.varb='c' then 'c' else a.varc end) as varc from

      have a

      left join

      (select monotonic() as sn, * from have group by id having sn=max(sn))b

      on a.id=b.id;

quit;

Haikuo

joehinson
Calcite | Level 5

Hello Bhpinder,

Certainly, the temporary array approach is the fastest. However, as you well know, every approach has its limitations, more or less.

For regular arrays, you need to know the size in advance; and you dare not mix variable types (characters and numeric).

Hashes ("associative arrays") have their limitations too, aside from being memory hogs and tending to take more programming lines at times.

The problem is simply a case of group processing as well as the ability to see the data as a matrix. Thus, to some extent, hashes can fit the bill.

Here is a one-pass-data hash solution: (I added a sequencing variable to the input data, just to preserve the original order)!

data have;

 

  infile datalines truncover; 

  input ID  (varb varc) ($);

  seq=_N_;

  datalines;

1       a       db

1       b       db

1      c        db

2      a       

2      a

3      a       db

3      a       db

3      b       db

3      c       db

4      b       db

5      b       db

5      b       db

5      a       db

6      b         

6      b       db

6      c       db

6      a         

6      b        

6      b       db

6      c        

;

 

data want(keep=seq ID varb varc);

     length swap 8;

     if(1=2)then set have;

     if _N_=1 then do; 

          array sw (2) $ _temporary_;

          declare hash info(ordered:"a");

          info.defineKey("id");

          info.defineData("id","swap","ctag", "hictag");

          info.defineDone();

          declare hiter hinfo("info");

          declare hash ctag;

          declare hiter hictag;

     end;

 

     do until(done);

           set have end=done;

           by id notsorted;

           if info.find() ne 0 then do;

                ctag=_new_ hash(ordered:"a",multidata:"y");

                hictag=_new_ hiter("ctag");

                ctag.defineKey("id");

                ctag.defineData("seq","id","varb","varc");

                ctag.defineDone();

                rca=info.replace();

           end;

                     rcb=ctag.replace();


          swap=((last.id) * (varb="c")) ;

          info.replace();

     end;

 

     hinfo.first();

     do until (hinfo.next() ne 0);

           hictag.first();

           do until (hictag.next() ne 0);

                sw(1)=varb; sw(2)="c";

                varb=sw(swap+1);

                output;

                call missing( varb, varc);

           end;

     end;

stop;

run;

Haikuo
Onyx | Level 15

Hi , was expecting a HOH from you. Disappointed. Smiley Wink

Haikuo

joehinson
Calcite | Level 5

Hello My Friend Haikuo,

How could you possibly fail to notice? (You need a new optician or opthamologist. LOL!)

It is indeed HOH (just a 2-level nesting). ctag and hictag are objects within the info hash table;

declare hash info(ordered:"a");

          info.defineKey("id");

          info.defineData("id","swap","ctag", "hictag");

          info.defineDone();

          declare hiter hinfo("info");

          declare hash ctag;

          declare hiter hictag;

                    -----

                    -----       

if info.find() ne 0 then do;

                ctag=_new_ hash(ordered:"a",multidata:"y");

                hictag=_new_ hiter("ctag");

                ctag.defineKey("id");

                ctag.defineData("seq","id","varb","varc");

                ctag.defineDone();

                rca=info.replace();

           end;

                     rcb=ctag.replace();


Haikuo
Onyx | Level 15

LOL, indeed.

Haikuo

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1233 views
  • 6 likes
  • 5 in conversation