Solved: Re: Populating values

bnarang · Posted 11-22-2012 01:50 PM

Hi All

My question is related to applying the logic. Suppose, I have a dataset which contains the time series information of some customers. So, i make certain check on each customers latest information. i.e each customer last row and if condition true the value is assigned. But I want to populate the same value to the above rows for the same customer. I give an example here:

This is what I have.

ID varb varc

1 a db

1 b db

1 c db

2 a

3 a db

3 b db

3 c db

So, I want to check each ID (1, 2, 3). If the value of last.Id eq 'c' then varc = 'c'. But it populates for that particular row. i want it to populate for the previous rows as well. How can it be done efficiently?

Thanks

art297 · Posted 11-22-2012 02:14 PM

Here is one way it could be done:

data have;

infile cards truncover;

input ID (varb varc) ($);

cards;

1 a db

1 b db

1 c db

2 a

3 a db

3 b db

3 c db

;

data want (drop=_:);

do until (last.id);

set have;

by id;

if last.id and varb eq 'c' then _hold='c';

end;

do until (last.id);

set have;

by id;

if not missing(_hold) then varc=_hold;

output;

end;

run;

View solution in original post

art297 · Posted 11-22-2012 02:14 PM

Here is one way it could be done:

data have;

infile cards truncover;

input ID (varb varc) ($);

cards;

1 a db

1 b db

1 c db

2 a

3 a db

3 b db

3 c db

;

data want (drop=_:);

do until (last.id);

set have;

by id;

if last.id and varb eq 'c' then _hold='c';

end;

do until (last.id);

set have;

by id;

if not missing(_hold) then varc=_hold;

output;

end;

run;

Ksharp · Posted 11-22-2012 09:17 PM

The fast way is Hash Table if you are familliar with it.

Another way is picking up these " last.Id eq 'c' " observations , and merge them back into original dataset by using MERGE statement.

At this time, I want take some idea from ArthurT, Array in dataset skill which is extremely fast.

data have;
  infile cards truncover;
  input ID  (varb varc) ($);
  cards;
1       a          db
1       b          db
1      c           db
2      a        
2      a
3      a          db
3      a          db
3      b          db
3      c          db
;
run;
data want;
 set have;
 by id;
 array b{100000} $ _temporary_;
 array c{100000} $ _temporary_;
 n+1;
 b{n}=varb;c{n}=varc;
 if last.id  then do;
                    if varb='c' then do;
                      do i=1 to n;
                        varb=b{i};
                        varc='c';
                        output;
                         n=0;
                       end;
                             end;
                     else  do;
                       do i=1 to n;
                         varb=b{i};
                         varc=c{i};
                         output;
                         n=0;
                        end;
                             end;
                    end;
run;

Ksharp

Message was edited by: xia keshan

bnarang · Posted 11-22-2012 09:31 PM

Thanks Arthur for a quick reply. , Can you please take the above example and explain about Hash Merge, i am not aware about it. It will be of great help.

Thanks

Ksharp · Posted 11-22-2012 09:36 PM

Using the code , I think it is fastest. And Don't forget to check the result.

Haikuo · Posted 11-23-2012 01:12 PM

Ok, I am not capping anything here, just trying to offer some comments:

From all of the solutions possible for your scenario, at least 2 passes (reading data twice) will inevitably be involved. First pass has to be your data storage (normally your hard drive) to your CPU, the only difference lies upon the second pass, whether it is still storage-CPU or RAM-CPU, obviously the latter is more efficient. However, there are other factors to be considered in addition to performance, such as robustness, simplicity and easiness of programming.

For the most efficient approach, go with Ksharp. However, even though the maximum obs per id is unlikely to exceed 100k, for a pure perfectionist of programming (Art will laugh at this), you may still not be completely comfortable, and then you should go with Art. Besides, if you have a lot more variables other than just one ‘varb’, Ksharp ‘s solution will have to engage multidimensional array as well as bunch of hard coding.

Hash() will come in handy at first sight. It is more efficient than Art’s DOW, while circumventing the possible hard coding when involving many more variables. However, it requires that your whole dataset to be fit into the RAM at once, which could be a problem if you have a huge dataset and a limited RAM.

Also, note that performance wise, the edge of array() or Hash() will only become prominent if your dataset has lots of variables( a wide table) and lots of records (a long table).

In general, I will stay with data step even though Proc SQL is offering a solution.

data have;

infile cards truncover;

input ID (varb varc) ($);

cards;

1 a db

1 b db

1 c db

2 a

3 a db

3 b db

3 c db

;

/*hash*/

data want;

if _n_=1 then do;

if 0 then set have;

length _varc $8.;

declare hash h(dataset:'have', multidata:'y');

h.definekey('id');

h.definedata(all:'y');

h.definedone();

end;

set have;

by id;

if last.id then do;

if varb='c' then do;

_varc='c';

_rc=h.find();

do _rc=0 by 0 while (_rc=0);

varc=_varc;

output;

_rc=h.find_next();

end;

else do;

_rc=h.find();

do _rc=0 by 0 while (_rc=0);

output;

_rc=h.find_next();

end;

drop _:;

run;

/*sql*/

proc sql noprint;

create table want as

select a.id, a.varb, (case when b.varb='c' then 'c' else a.varc end) as varc from

have a

left join

(select monotonic() as sn, * from have group by id having sn=max(sn))b

on a.id=b.id;

quit;

Haikuo

joehinson · Posted 11-23-2012 01:19 PM

Hello Bhpinder,

Certainly, the temporary array approach is the fastest. However, as you well know, every approach has its limitations, more or less.

For regular arrays, you need to know the size in advance; and you dare not mix variable types (characters and numeric).

Hashes ("associative arrays") have their limitations too, aside from being memory hogs and tending to take more programming lines at times.

The problem is simply a case of group processing as well as the ability to see the data as a matrix. Thus, to some extent, hashes can fit the bill.

Here is a one-pass-data hash solution: (I added a sequencing variable to the input data, just to preserve the original order)!

data have;

infile datalines truncover;

input ID (varb varc) ($);

seq=_N_;

datalines;

1 a db

1 b db

1 c db

2 a

3 a db

3 b db

3 c db

4 b db

5 b db

5 a db

6 b

6 b db

6 c db

6 a

6 b

6 b db

6 c

;

data want(keep=seq ID varb varc);

length swap 8;

if(1=2)then set have;

if _N_=1 then do;

array sw (2) $ _temporary_;

declare hash info(ordered:"a");

info.defineKey("id");

info.defineData("id","swap","ctag", "hictag");

info.defineDone();

declare hiter hinfo("info");

declare hash ctag;

declare hiter hictag;

end;

do until(done);

set have end=done;

by id notsorted;

if info.find() ne 0 then do;

ctag=_new_ hash(ordered:"a",multidata:"y");

hictag=_new_ hiter("ctag");

ctag.defineKey("id");

ctag.defineData("seq","id","varb","varc");

ctag.defineDone();

rca=info.replace();

end;

rcb=ctag.replace();

swap=((last.id) * (varb="c")) ;

info.replace();

end;

hinfo.first();

do until (hinfo.next() ne 0);

hictag.first();

do until (hictag.next() ne 0);

sw(1)=varb; sw(2)="c";

varb=sw(swap+1);

output;

call missing( varb, varc);

end;

stop;

run;

Haikuo · Posted 11-23-2012 02:04 PM

Hi , was expecting a HOH from you. Disappointed.

Haikuo

joehinson · Posted 11-23-2012 02:40 PM

Hello My Friend Haikuo,

How could you possibly fail to notice? (You need a new optician or opthamologist. LOL!)

It is indeed HOH (just a 2-level nesting). ctag and hictag are objects within the info hash table;

declare hash info(ordered:"a");

info.defineKey("id");

info.defineData("id","swap","ctag", "hictag");

info.defineDone();

declare hiter hinfo("info");

declare hash ctag;

declare hiter hictag;

-----

if info.find() ne 0 then do;

ctag=_new_ hash(ordered:"a",multidata:"y");

hictag=_new_ hiter("ctag");

ctag.defineKey("id");

ctag.defineData("seq","id","varb","varc");

ctag.defineDone();

rca=info.replace();

end;

rcb=ctag.replace();

Haikuo · Posted 11-23-2012 02:51 PM

LOL, indeed.

Haikuo

SAS Innovate 2025: Call for Content

Classroom Training Available!