BookmarkSubscribeRSS Feed
varunnakra
Fluorite | Level 6

Hi,

I have the following dataset 'test' with a variable 'varname' having values - abc,' ' , ' ', ' ', ' ', xyz,' ' , ' ', ' ', ' '.

I need 'varname' to have values -   abc,abc,abc,abc,xyz,xyz,xyz,xyz, i.e. get abc's and xyz's in the place of blanks.

I know that lag function doesnt give 'proper' results when executed conditionally, however, even though I am using it un-conditionally the code below doesn't give the desired output.

I can think of a logical error in data step passes, however, wanted to confirm the same here.

/* Data set creation */

data test;
input varname $ 10.;
datalines;
abc
  
  
  
  
xyz
  
  
  
  
;

/* Code to yield desired output */

data test1;
set test;
varname1 = lag(varname);
if varname = ' ' then varname = varname1;

run;

My thoughts -

The code above doesn't yield the desired output because the lag function works on the input line read from the test dataset in each pass and I am trying to change the value of the variable varname during the datastep, however, the lag function doesn't pick up the changed value because it works on the input value read from the test dataset. Even if I keep the same name of the output dataset, it doesn't work.

Please confirm..

12 REPLIES 12
varunnakra
Fluorite | Level 6

Also, I have done this by transposing the dataset and using arrays. Is there any other way to do it ?

Linlin
Lapis Lazuli | Level 10

you should use retain:

data test;

infile cards missover;

input id varname $ 10.;

cards;

1 abc

2

3

4 xyz

;

data want(drop=_:);

length _var $ 10;

  set test;

  retain _var ;

  _var=ifc(varname='',_var,varname);

varname=_var;

proc print;run;

kuridisanjeev
Quartz | Level 8

Hi Try This..

data test;

infile cards missover;

input id region$ 10.;

cards;

1 abc

2

3

4 xyz

;

data want ;

   set test;

   retain _region;

   drop _region;

   if region = ' ' then region=_region;

   else _region = region;

run;

Regards..

Sanjeev.K

Hentiedp
Fluorite | Level 6

Hi,

Retain should be the best to use to get the desired results, as for the LAG function. LAG will return the previous value from a previous OBS and not automatically retain it. So if the previous value is missing, it will not matter since that was the previous value. For carrying values forward, my personal experience has been to be cautious when using LAG, especially in cases where there's missing data, it can make life very difficult.

Important thing to remember is to be carefull to execute it in a condition - results can be very unpredictable.  (Think I read it somewhere) So let LAG  execute once and then work from there, don't conditionally exclude it.

HTH

Hentie

Patrick
Opal | Level 21

You are using the lag() function unconditional so no problems there, BUT: The lag function just gives you the value of how it was one iteration before - and if there was already a missing then the lag() will give you this missing.

Try:

data test(drop=_:);
input _v $ 10.;
length varname $10;
retain varname;
varname=coalescec(_v,varname);
datalines;
abc
  
  
  
  
xyz
  
  
  
  
;
run;

varunnakra
Fluorite | Level 6

Thanks everyone for the inputs..

Conclusively, the LAG function works on the previous obs of the input dataset.. Since I update the value of varname during the datastep and this value is automatically retained but gets replaced by the new value from the input dataset. However, the change hasn't been done in the input dataset so the lag function considers missing to be the previous obs or value and not the changed value of varname. Going by this logic if I try changing the input dataset itself by keeping the same dataset names for data and set statements then the lag function should fetch the previous obs to be the changed value..However, even this doesnt happen.

So my question is how does the lag function keeps a track of input observations? Such that even when I try changing them during the data step (keeping the same dataset names in data and set statements), the change doesn't get reflected..?

PGStats
Opal | Level 21

The LAG function is implemented as a simple FIFO (first in, first out) queue. LAG(x) puts the current value of x in storage and returns the value of x from the last time it was called.  I find that it is not as useful as it seems at first. I use it only in last resort.

PG

PG
Astounding
PROC Star

There are two basic concepts that don't work the way you expect.

First is the LAG function itself.  It does not retrieve the value from the previous observation.  It retrieves the value a variable had the last time that the LAG function executed.  That's the reason that you have been warned about conditionally executing the LAG function.  It must execute every time, in order to get the value from the previous observation.

Second is using the same name on the DATA and SET statements.  That does not mean you are working with the same data for both input and output.  The DATA step writes the newly created data (mentioned on the DATA statement) to blank disk space.  When the DATA step has completed without error, it then assigns the newly created data set the proper name.  If the same name has been used as appeared on the SET statement, then the old version of the data (before the DATA step began) is designated as reusable disk space.

hope this helps ...

art297
Opal | Level 21

NOTE:  THE FOLLOWING DOES NOT WORK!  I AM NOT DELETING THE POST AS ID DON'T WANT TO REMOVE PG'S CORRECT RESPONSE!!!!!

This post is more a request for confirmation from my colleagues than it is a suggestion.

I agree with Linlin's suggestion of using retain and with PG's and Astounding's explanations of the other concepts not working in the way that you expect.

However, Astounding's pointing out that your misconception about using the same file name twice got me thinking of yet another possible way of achieving your own lag.

Team: Can you see any flaw in the logic of the following approach?:

data test;

  infile cards missover;

  input id varname $ 10.;

cards;

1 abc

2

3

4 xyz

5

6 xxx

7

8

9 yyy

;

data want (drop=_:);

  set test;

  if _n_ gt 1 then _point=_n_-1;

  else _point=1;

  set want (  keep=varname rename=(varname=_dummy)) point=_point;

  if missing(varname) then varname=_dummy;

run;

PGStats
Opal | Level 21

set want point=_point doesn't refer to the dataset being built; at best, it will refer to the previous version of work.want. The first time you run this code, you will get :

ERROR: File WORK.WANT.DATA does not exist.

PG

PG
art297
Opal | Level 21

PG: Absolutely correct!  Knew it was too good to never having been mentioned as an alternative.

Conversely, is there a way of identifying the real temporary name of the file being created and, if so, can it be accessed during creation?

Astounding
PROC Star

I think the closest thing to what you're thinking of is a hash table, where _n_ (or its equivalent) is the key and the other variables are the look-up elements.  As long as you don't run out of memory, just add to the hash table one observation at a time.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1985 views
  • 6 likes
  • 8 in conversation