Lag function and data step processing

varunnakra · Posted 10-07-2012 03:14 PM

Hi,

I have the following dataset 'test' with a variable 'varname' having values - abc,' ' , ' ', ' ', ' ', xyz,' ' , ' ', ' ', ' '.

I need 'varname' to have values - abc,abc,abc,abc,xyz,xyz,xyz,xyz, i.e. get abc's and xyz's in the place of blanks.

I know that lag function doesnt give 'proper' results when executed conditionally, however, even though I am using it un-conditionally the code below doesn't give the desired output.

I can think of a logical error in data step passes, however, wanted to confirm the same here.

/* Data set creation */

data test;
input varname $ 10.;
datalines;
abc




xyz




;

/* Code to yield desired output */

data test1;
set test;
varname1 = lag(varname);
if varname = ' ' then varname = varname1;

run;

My thoughts -

The code above doesn't yield the desired output because the lag function works on the input line read from the test dataset in each pass and I am trying to change the value of the variable varname during the datastep, however, the lag function doesn't pick up the changed value because it works on the input value read from the test dataset. Even if I keep the same name of the output dataset, it doesn't work.

Please confirm..

varunnakra · Posted 10-07-2012 03:31 PM

Also, I have done this by transposing the dataset and using arrays. Is there any other way to do it ?

Linlin · Posted 10-07-2012 03:46 PM

you should use retain:

data test;

infile cards missover;

input id varname $ 10.;

cards;

1 abc

2

3

4 xyz

;

data want(drop=_:);

length _var $ 10;

set test;

retain _var ;

_var=ifc(varname='',_var,varname);

varname=_var;

proc print;run;

kuridisanjeev · Posted 10-08-2012 05:48 AM

Hi Try This..

data test;

infile cards missover;

input id region$ 10.;

cards;

1 abc

2

3

4 xyz

;

data want ;

set test;

retain _region;

drop _region;

if region = ' ' then region=_region;

else _region = region;

run;

Regards..

Sanjeev.K

Hentiedp · Posted 10-08-2012 11:19 AM

Hi,

Retain should be the best to use to get the desired results, as for the LAG function. LAG will return the previous value from a previous OBS and not automatically retain it. So if the previous value is missing, it will not matter since that was the previous value. For carrying values forward, my personal experience has been to be cautious when using LAG, especially in cases where there's missing data, it can make life very difficult.

Important thing to remember is to be carefull to execute it in a condition - results can be very unpredictable. (Think I read it somewhere) So let LAG execute once and then work from there, don't conditionally exclude it.

HTH

Hentie

Patrick · Posted 10-08-2012 11:36 AM

You are using the lag() function unconditional so no problems there, BUT: The lag function just gives you the value of how it was one iteration before - and if there was already a missing then the lag() will give you this missing.

Try:

data test(drop=_:);
input _v $ 10.;
length varname $10;
retain varname;
varname=coalescec(_v,varname);
datalines;
abc




xyz




;
run;

varunnakra · Posted 10-08-2012 02:46 PM

Thanks everyone for the inputs..

Conclusively, the LAG function works on the previous obs of the input dataset.. Since I update the value of varname during the datastep and this value is automatically retained but gets replaced by the new value from the input dataset. However, the change hasn't been done in the input dataset so the lag function considers missing to be the previous obs or value and not the changed value of varname. Going by this logic if I try changing the input dataset itself by keeping the same dataset names for data and set statements then the lag function should fetch the previous obs to be the changed value..However, even this doesnt happen.

So my question is how does the lag function keeps a track of input observations? Such that even when I try changing them during the data step (keeping the same dataset names in data and set statements), the change doesn't get reflected..?

PGStats · Posted 10-08-2012 03:10 PM

The LAG function is implemented as a simple FIFO (first in, first out) queue. LAG(x) puts the current value of x in storage and returns the value of x from the last time it was called. I find that it is not as useful as it seems at first. I use it only in last resort.

PG

Astounding · Posted 10-08-2012 03:13 PM

There are two basic concepts that don't work the way you expect.

First is the LAG function itself. It does not retrieve the value from the previous observation. It retrieves the value a variable had the last time that the LAG function executed. That's the reason that you have been warned about conditionally executing the LAG function. It must execute every time, in order to get the value from the previous observation.

Second is using the same name on the DATA and SET statements. That does not mean you are working with the same data for both input and output. The DATA step writes the newly created data (mentioned on the DATA statement) to blank disk space. When the DATA step has completed without error, it then assigns the newly created data set the proper name. If the same name has been used as appeared on the SET statement, then the old version of the data (before the DATA step began) is designated as reusable disk space.

hope this helps ...

art297 · Posted 10-08-2012 04:54 PM

NOTE: THE FOLLOWING DOES NOT WORK! I AM NOT DELETING THE POST AS ID DON'T WANT TO REMOVE PG'S CORRECT RESPONSE!!!!!

This post is more a request for confirmation from my colleagues than it is a suggestion.

I agree with Linlin's suggestion of using retain and with PG's and Astounding's explanations of the other concepts not working in the way that you expect.

However, Astounding's pointing out that your misconception about using the same file name twice got me thinking of yet another possible way of achieving your own lag.

Team: Can you see any flaw in the logic of the following approach?:

data test;

infile cards missover;

input id varname $ 10.;

cards;

1 abc

2

3

4 xyz

5

6 xxx

7

8

9 yyy

;

data want (drop=_:);

set test;

if _n_ gt 1 then _point=_n_-1;

else _point=1;

set want ( keep=varname rename=(varname=_dummy)) point=_point;

if missing(varname) then varname=_dummy;

run;

PGStats · Posted 10-08-2012 05:44 PM

set want point=_point doesn't refer to the dataset being built; at best, it will refer to the previous version of work.want. The first time you run this code, you will get :

ERROR: File WORK.WANT.DATA does not exist.

PG

art297 · Posted 10-08-2012 06:11 PM

PG: Absolutely correct! Knew it was too good to never having been mentioned as an alternative.

Conversely, is there a way of identifying the real temporary name of the file being created and, if so, can it be accessed during creation?

Astounding · Posted 10-08-2012 07:07 PM

I think the closest thing to what you're thinking of is a hash table, where _n_ (or its equivalent) is the key and the other variables are the look-up elements. As long as you don't run out of memory, just add to the hash table one observation at a time.

Lag function and data step processing

Re: Lag function and data step processing

Re: Lag function and data step processing

Re: Lag function and data step processing

Re: Lag function and data step processing

Re: Lag function and data step processing

Re: Lag function and data step processing

Re: Lag function and data step processing

Re: Lag function and data step processing

Re: Lag function and data step processing

Re: Lag function and data step processing

Re: Lag function and data step processing

Re: Lag function and data step processing

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away