DATA Step, Macro, Functions and more

Populating values

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 86
Accepted Solution

Populating values

Hi All

My question is related to applying the logic. Suppose, I have a dataset which contains the time series information of some customers. So, i make certain check on each customers latest information. i.e each customer last row and if condition true the value is assigned. But I want to populate the same value to the above rows for the same customer. I give an example here:

This is what I have.

ID      varb     varc

1       a          db

1       b          db

1      c           db

2      a         

2      a

3      a          db

3      a          db

3      b          db

3      c          db

So, I want to check each ID (1, 2, 3). If the value of last.Id eq 'c' then varc = 'c'. But it populates for that particular row. i want it to populate for the previous rows as well. How can it be done efficiently?

Thanks


Accepted Solutions
Solution
‎11-22-2012 02:14 PM
PROC Star
Posts: 7,360

Re: Populating values

Here is one way it could be done:

data have;

  infile cards truncover;

  input ID  (varb varc) ($);

  cards;

1       a          db

1       b          db

1      c           db

2      a        

2      a

3      a          db

3      a          db

3      b          db

3      c          db

;

data want (drop=_Smiley Happy;

  do until (last.id);

    set have;

    by id;

    if last.id and varb eq 'c' then _hold='c';

  end;

  do until (last.id);

    set have;

    by id;

    if not missing(_hold) then varc=_hold;

    output;

  end;

run;

View solution in original post


All Replies
Solution
‎11-22-2012 02:14 PM
PROC Star
Posts: 7,360

Re: Populating values

Here is one way it could be done:

data have;

  infile cards truncover;

  input ID  (varb varc) ($);

  cards;

1       a          db

1       b          db

1      c           db

2      a        

2      a

3      a          db

3      a          db

3      b          db

3      c          db

;

data want (drop=_Smiley Happy;

  do until (last.id);

    set have;

    by id;

    if last.id and varb eq 'c' then _hold='c';

  end;

  do until (last.id);

    set have;

    by id;

    if not missing(_hold) then varc=_hold;

    output;

  end;

run;

Super User
Posts: 9,673

Re: Populating values

The fast way is Hash Table if you are familliar with it.

Another way is picking up these "  last.Id eq 'c'  " observations , and merge them back into original dataset  by using MERGE statement.

At this time, I want take some idea from ArthurT, Array in dataset skill which is extremely fast.

data have;
  infile cards truncover;
  input ID  (varb varc) ($);
  cards;
1       a          db
1       b          db
1      c           db
2      a        
2      a
3      a          db
3      a          db
3      b          db
3      c          db
;
run;
data want;
 set have;
 by id;
 array b{100000} $ _temporary_;
 array c{100000} $ _temporary_;
 n+1;
 b{n}=varb;c{n}=varc;
 if last.id  then do;
                    if varb='c' then do;
                      do i=1 to n;
                        varb=b{i};
                        varc='c';
                        output;
                         n=0;
                       end;
                             end;
                     else  do;
                       do i=1 to n;
                         varb=b{i};
                         varc=c{i};
                         output;
                         n=0;
                        end;
                             end;
                    end;
run;
 

Ksharp

Message was edited by: xia keshan

Frequent Contributor
Posts: 86

Re: Populating values

Thanks Arthur for a quick reply. , Can you please take the above example and explain about Hash Merge, i am not aware about it. It will be of great help.

Thanks

Super User
Posts: 9,673

Re: Populating values

Using the code , I think it is fastest. And Don't forget to check the result.

Respected Advisor
Posts: 3,124

Re: Populating values

Ok, I am not capping anything here, just trying to offer some comments:

From all of the solutions possible for your scenario, at least 2 passes (reading data twice) will inevitably be involved. First pass has to be your data storage (normally your hard drive) to your CPU, the only difference lies upon the second pass, whether it is still storage-CPU or RAM-CPU, obviously the latter is more efficient. However, there are other factors to be considered in addition to performance, such as robustness, simplicity and easiness of programming.

For the most efficient approach, go with Ksharp. However, even though the maximum obs per id is unlikely to exceed 100k, for a pure perfectionist of programming (Art will laugh at this), you may still not be completely comfortable, and then you should go with Art. Besides, if you have a lot more variables other than just one ‘varb’, Ksharp ‘s solution will have to engage multidimensional array as well as bunch of hard coding.

Hash() will come in handy at first sight. It is more efficient than Art’s DOW, while circumventing the possible hard coding when involving many more variables. However, it requires that your whole dataset to be fit into the RAM at once, which could be a problem if you have a huge dataset and a limited RAM.

Also, note that performance wise, the edge of array() or Hash() will only become prominent if your dataset has lots of variables( a wide table) and lots of records (a long table).

In general, I will stay with data step even though Proc SQL is offering a solution.

 

data have;

  infile cards truncover;

  input ID  (varb varc) ($);

  cards;

1       a          db

1       b          db

1      c db

2      a

2      a

3      a db

3      a db

3      b db

3      c db

;

/*hash*/

data want;

   if _n_=1 then do;

     if 0 then set have;

       length _varc $8.;

       declare hash h(dataset:'have', multidata:'y');

     h.definekey('id');

       h.definedata(all:'y');

       h.definedone();

      end;

      set have;

        by id;

        if last.id then do;

if varb='c' then do;

_varc='c';

            _rc=h.find();

            do _rc=0 by 0 while (_rc=0);

                     varc=_varc;

                     output;

                     _rc=h.find_next();

                  end;

end;

             else do;

_rc=h.find();

do _rc=0 by 0 while (_rc=0);

                     output;

                     _rc=h.find_next();

                   end;

             end;

        end;

drop _:;

run;

/*sql*/

proc sql noprint;

  create table want as

    select a.id, a.varb, (case when b.varb='c' then 'c' else a.varc end) as varc from

      have a

      left join

      (select monotonic() as sn, * from have group by id having sn=max(sn))b

      on a.id=b.id;

quit;

Haikuo

Contributor
Posts: 45

Re: Populating values

Hello Bhpinder,

Certainly, the temporary array approach is the fastest. However, as you well know, every approach has its limitations, more or less.

For regular arrays, you need to know the size in advance; and you dare not mix variable types (characters and numeric).

Hashes ("associative arrays") have their limitations too, aside from being memory hogs and tending to take more programming lines at times.

The problem is simply a case of group processing as well as the ability to see the data as a matrix. Thus, to some extent, hashes can fit the bill.

Here is a one-pass-data hash solution: (I added a sequencing variable to the input data, just to preserve the original order)!

data have;

 

  infile datalines truncover; 

  input ID  (varb varc) ($);

  seq=_N_;

  datalines;

1       a       db

1       b       db

1      c        db

2      a       

2      a

3      a       db

3      a       db

3      b       db

3      c       db

4      b       db

5      b       db

5      b       db

5      a       db

6      b         

6      b       db

6      c       db

6      a         

6      b        

6      b       db

6      c        

;

 

data want(keep=seq ID varb varc);

     length swap 8;

     if(1=2)then set have;

     if _N_=1 then do; 

          array sw (2) $ _temporary_;

          declare hash info(ordered:"a");

          info.defineKey("id");

          info.defineData("id","swap","ctag", "hictag");

          info.defineDone();

          declare hiter hinfo("info");

          declare hash ctag;

          declare hiter hictag;

     end;

 

     do until(done);

           set have end=done;

           by id notsorted;

           if info.find() ne 0 then do;

                ctag=_new_ hash(ordered:"a",multidata:"y");

                hictag=_new_ hiter("ctag");

                ctag.defineKey("id");

                ctag.defineData("seq","id","varb","varc");

                ctag.defineDone();

                rca=info.replace();

           end;

                     rcb=ctag.replace();


          swap=((last.id) * (varb="c")) ;

          info.replace();

     end;

 

     hinfo.first();

     do until (hinfo.next() ne 0);

           hictag.first();

           do until (hictag.next() ne 0);

                sw(1)=varb; sw(2)="c";

                varb=sw(swap+1);

                output;

                call missing( varb, varc);

           end;

     end;

stop;

run;

Respected Advisor
Posts: 3,124

Re: Populating values

Hi , was expecting a HOH from you. Disappointed. Smiley Wink

Haikuo

Contributor
Posts: 45

Re: Populating values

Hello My Friend Haikuo,

How could you possibly fail to notice? (You need a new optician or opthamologist. LOL!)

It is indeed HOH (just a 2-level nesting). ctag and hictag are objects within the info hash table;

declare hash info(ordered:"a");

          info.defineKey("id");

          info.defineData("id","swap","ctag", "hictag");

          info.defineDone();

          declare hiter hinfo("info");

          declare hash ctag;

          declare hiter hictag;

                    -----

                    -----       

if info.find() ne 0 then do;

                ctag=_new_ hash(ordered:"a",multidata:"y");

                hictag=_new_ hiter("ctag");

                ctag.defineKey("id");

                ctag.defineData("seq","id","varb","varc");

                ctag.defineDone();

                rca=info.replace();

           end;

                     rcb=ctag.replace();


Respected Advisor
Posts: 3,124

Re: Populating values

LOL, indeed.

Haikuo

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 348 views
  • 6 likes
  • 5 in conversation