Hi All
My question is related to applying the logic. Suppose, I have a dataset which contains the time series information of some customers. So, i make certain check on each customers latest information. i.e each customer last row and if condition true the value is assigned. But I want to populate the same value to the above rows for the same customer. I give an example here:
This is what I have.
ID varb varc
1 a db
1 b db
1 c db
2 a
2 a
3 a db
3 a db
3 b db
3 c db
So, I want to check each ID (1, 2, 3). If the value of last.Id eq 'c' then varc = 'c'. But it populates for that particular row. i want it to populate for the previous rows as well. How can it be done efficiently?
Thanks
Here is one way it could be done:
data have;
infile cards truncover;
input ID (varb varc) ($);
cards;
1 a db
1 b db
1 c db
2 a
2 a
3 a db
3 a db
3 b db
3 c db
;
data want (drop=_:);
do until (last.id);
set have;
by id;
if last.id and varb eq 'c' then _hold='c';
end;
do until (last.id);
set have;
by id;
if not missing(_hold) then varc=_hold;
output;
end;
run;
Here is one way it could be done:
data have;
infile cards truncover;
input ID (varb varc) ($);
cards;
1 a db
1 b db
1 c db
2 a
2 a
3 a db
3 a db
3 b db
3 c db
;
data want (drop=_:);
do until (last.id);
set have;
by id;
if last.id and varb eq 'c' then _hold='c';
end;
do until (last.id);
set have;
by id;
if not missing(_hold) then varc=_hold;
output;
end;
run;
The fast way is Hash Table if you are familliar with it.
Another way is picking up these " last.Id eq 'c' " observations , and merge them back into original dataset by using MERGE statement.
At this time, I want take some idea from ArthurT, Array in dataset skill which is extremely fast.
data have; infile cards truncover; input ID (varb varc) ($); cards; 1 a db 1 b db 1 c db 2 a 2 a 3 a db 3 a db 3 b db 3 c db ; run; data want; set have; by id; array b{100000} $ _temporary_; array c{100000} $ _temporary_; n+1; b{n}=varb;c{n}=varc; if last.id then do; if varb='c' then do; do i=1 to n; varb=b{i}; varc='c'; output; n=0; end; end; else do; do i=1 to n; varb=b{i}; varc=c{i}; output; n=0; end; end; end; run;
Ksharp
Message was edited by: xia keshan
Using the code , I think it is fastest. And Don't forget to check the result.
Ok, I am not capping anything here, just trying to offer some comments:
From all of the solutions possible for your scenario, at least 2 passes (reading data twice) will inevitably be involved. First pass has to be your data storage (normally your hard drive) to your CPU, the only difference lies upon the second pass, whether it is still storage-CPU or RAM-CPU, obviously the latter is more efficient. However, there are other factors to be considered in addition to performance, such as robustness, simplicity and easiness of programming.
For the most efficient approach, go with Ksharp. However, even though the maximum obs per id is unlikely to exceed 100k, for a pure perfectionist of programming (Art will laugh at this), you may still not be completely comfortable, and then you should go with Art. Besides, if you have a lot more variables other than just one ‘varb’, Ksharp ‘s solution will have to engage multidimensional array as well as bunch of hard coding.
Hash() will come in handy at first sight. It is more efficient than Art’s DOW, while circumventing the possible hard coding when involving many more variables. However, it requires that your whole dataset to be fit into the RAM at once, which could be a problem if you have a huge dataset and a limited RAM.
Also, note that performance wise, the edge of array() or Hash() will only become prominent if your dataset has lots of variables( a wide table) and lots of records (a long table).
In general, I will stay with data step even though Proc SQL is offering a solution.
data have;
infile cards truncover;
input ID (varb varc) ($);
cards;
1 a db
1 b db
1 c db
2 a
2 a
3 a db
3 a db
3 b db
3 c db
;
/*hash*/
data want;
if _n_=1 then do;
if 0 then set have;
length _varc $8.;
declare hash h(dataset:'have', multidata:'y');
h.definekey('id');
h.definedata(all:'y');
h.definedone();
end;
set have;
by id;
if last.id then do;
if varb='c' then do;
_varc='c';
_rc=h.find();
do _rc=0 by 0 while (_rc=0);
varc=_varc;
output;
_rc=h.find_next();
end;
end;
else do;
_rc=h.find();
do _rc=0 by 0 while (_rc=0);
output;
_rc=h.find_next();
end;
end;
end;
drop _:;
run;
/*sql*/
proc sql noprint;
create table want as
select a.id, a.varb, (case when b.varb='c' then 'c' else a.varc end) as varc from
have a
left join
(select monotonic() as sn, * from have group by id having sn=max(sn))b
on a.id=b.id;
quit;
Haikuo
Hello Bhpinder,
Certainly, the temporary array approach is the fastest. However, as you well know, every approach has its limitations, more or less.
For regular arrays, you need to know the size in advance; and you dare not mix variable types (characters and numeric).
Hashes ("associative arrays") have their limitations too, aside from being memory hogs and tending to take more programming lines at times.
The problem is simply a case of group processing as well as the ability to see the data as a matrix. Thus, to some extent, hashes can fit the bill.
Here is a one-pass-data hash solution: (I added a sequencing variable to the input data, just to preserve the original order)!
data have;
infile datalines truncover;
input ID (varb varc) ($);
seq=_N_;
datalines;
1 a db
1 b db
1 c db
2 a
2 a
3 a db
3 a db
3 b db
3 c db
4 b db
5 b db
5 b db
5 a db
6 b
6 b db
6 c db
6 a
6 b
6 b db
6 c
;
data want(keep=seq ID varb varc);
length swap 8;
if(1=2)then set have;
if _N_=1 then do;
array sw (2) $ _temporary_;
declare hash info(ordered:"a");
info.defineKey("id");
info.defineData("id","swap","ctag", "hictag");
info.defineDone();
declare hiter hinfo("info");
declare hash ctag;
declare hiter hictag;
end;
do until(done);
set have end=done;
by id notsorted;
if info.find() ne 0 then do;
ctag=_new_ hash(ordered:"a",multidata:"y");
hictag=_new_ hiter("ctag");
ctag.defineKey("id");
ctag.defineData("seq","id","varb","varc");
ctag.defineDone();
rca=info.replace();
end;
rcb=ctag.replace();
swap=((last.id) * (varb="c")) ;
info.replace();
end;
hinfo.first();
do until (hinfo.next() ne 0);
hictag.first();
do until (hictag.next() ne 0);
sw(1)=varb; sw(2)="c";
varb=sw(swap+1);
output;
call missing( varb, varc);
end;
end;
stop;
run;
Hello My Friend Haikuo,
How could you possibly fail to notice? (You need a new optician or opthamologist. LOL!)
It is indeed HOH (just a 2-level nesting). ctag and hictag are objects within the info hash table;
declare hash info(ordered:"a");
info.defineKey("id");
info.defineData("id","swap","ctag", "hictag");
info.defineDone();
declare hiter hinfo("info");
declare hash ctag;
declare hiter hictag;
-----
-----
if info.find() ne 0 then do;
ctag=_new_ hash(ordered:"a",multidata:"y");
hictag=_new_ hiter("ctag");
ctag.defineKey("id");
ctag.defineData("seq","id","varb","varc");
ctag.defineDone();
rca=info.replace();
end;
rcb=ctag.replace();
LOL, indeed.
Haikuo
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.