Hi,
I have the following dataset 'test' with a variable 'varname' having values - abc,' ' , ' ', ' ', ' ', xyz,' ' , ' ', ' ', ' '.
I need 'varname' to have values - abc,abc,abc,abc,xyz,xyz,xyz,xyz, i.e. get abc's and xyz's in the place of blanks.
I know that lag function doesnt give 'proper' results when executed conditionally, however, even though I am using it un-conditionally the code below doesn't give the desired output.
I can think of a logical error in data step passes, however, wanted to confirm the same here.
/* Data set creation */
data test;
input varname $ 10.;
datalines;
abc
xyz
;
/* Code to yield desired output */
data test1;
set test;
varname1 = lag(varname);
if varname = ' ' then varname = varname1;
run;
My thoughts -
The code above doesn't yield the desired output because the lag function works on the input line read from the test dataset in each pass and I am trying to change the value of the variable varname during the datastep, however, the lag function doesn't pick up the changed value because it works on the input value read from the test dataset. Even if I keep the same name of the output dataset, it doesn't work.
Please confirm..
Also, I have done this by transposing the dataset and using arrays. Is there any other way to do it ?
you should use retain:
data test;
infile cards missover;
input id varname $ 10.;
cards;
1 abc
2
3
4 xyz
;
data want(drop=_:);
length _var $ 10;
set test;
retain _var ;
_var=ifc(varname='',_var,varname);
varname=_var;
proc print;run;
Hi Try This..
data test;
infile cards missover;
input id region$ 10.;
cards;
1 abc
2
3
4 xyz
;
data want ;
set test;
retain _region;
drop _region;
if region = ' ' then region=_region;
else _region = region;
run;
Regards..
Sanjeev.K
Hi,
Retain should be the best to use to get the desired results, as for the LAG function. LAG will return the previous value from a previous OBS and not automatically retain it. So if the previous value is missing, it will not matter since that was the previous value. For carrying values forward, my personal experience has been to be cautious when using LAG, especially in cases where there's missing data, it can make life very difficult.
Important thing to remember is to be carefull to execute it in a condition - results can be very unpredictable. (Think I read it somewhere) So let LAG execute once and then work from there, don't conditionally exclude it.
HTH
Hentie
You are using the lag() function unconditional so no problems there, BUT: The lag function just gives you the value of how it was one iteration before - and if there was already a missing then the lag() will give you this missing.
Try:
data test(drop=_:);
input _v $ 10.;
length varname $10;
retain varname;
varname=coalescec(_v,varname);
datalines;
abc
xyz
;
run;
Thanks everyone for the inputs..
Conclusively, the LAG function works on the previous obs of the input dataset.. Since I update the value of varname during the datastep and this value is automatically retained but gets replaced by the new value from the input dataset. However, the change hasn't been done in the input dataset so the lag function considers missing to be the previous obs or value and not the changed value of varname. Going by this logic if I try changing the input dataset itself by keeping the same dataset names for data and set statements then the lag function should fetch the previous obs to be the changed value..However, even this doesnt happen.
So my question is how does the lag function keeps a track of input observations? Such that even when I try changing them during the data step (keeping the same dataset names in data and set statements), the change doesn't get reflected..?
The LAG function is implemented as a simple FIFO (first in, first out) queue. LAG(x) puts the current value of x in storage and returns the value of x from the last time it was called. I find that it is not as useful as it seems at first. I use it only in last resort.
PG
There are two basic concepts that don't work the way you expect.
First is the LAG function itself. It does not retrieve the value from the previous observation. It retrieves the value a variable had the last time that the LAG function executed. That's the reason that you have been warned about conditionally executing the LAG function. It must execute every time, in order to get the value from the previous observation.
Second is using the same name on the DATA and SET statements. That does not mean you are working with the same data for both input and output. The DATA step writes the newly created data (mentioned on the DATA statement) to blank disk space. When the DATA step has completed without error, it then assigns the newly created data set the proper name. If the same name has been used as appeared on the SET statement, then the old version of the data (before the DATA step began) is designated as reusable disk space.
hope this helps ...
NOTE: THE FOLLOWING DOES NOT WORK! I AM NOT DELETING THE POST AS ID DON'T WANT TO REMOVE PG'S CORRECT RESPONSE!!!!!
This post is more a request for confirmation from my colleagues than it is a suggestion.
I agree with Linlin's suggestion of using retain and with PG's and Astounding's explanations of the other concepts not working in the way that you expect.
However, Astounding's pointing out that your misconception about using the same file name twice got me thinking of yet another possible way of achieving your own lag.
Team: Can you see any flaw in the logic of the following approach?:
data test;
infile cards missover;
input id varname $ 10.;
cards;
1 abc
2
3
4 xyz
5
6 xxx
7
8
9 yyy
;
data want (drop=_:);
set test;
if _n_ gt 1 then _point=_n_-1;
else _point=1;
set want ( keep=varname rename=(varname=_dummy)) point=_point;
if missing(varname) then varname=_dummy;
run;
set want point=_point doesn't refer to the dataset being built; at best, it will refer to the previous version of work.want. The first time you run this code, you will get :
ERROR: File WORK.WANT.DATA does not exist.
PG
PG: Absolutely correct! Knew it was too good to never having been mentioned as an alternative.
Conversely, is there a way of identifying the real temporary name of the file being created and, if so, can it be accessed during creation?
I think the closest thing to what you're thinking of is a hash table, where _n_ (or its equivalent) is the key and the other variables are the look-up elements. As long as you don't run out of memory, just add to the hash table one observation at a time.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.