BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AnandsinhSindha
Obsidian | Level 7

Need help in this data step 

 

i have data sample...

Every year two bill generate for each id,and if year is missing i need to add year with missing amount.

 

data have

ID   Year  Amount

1   2002   100

1   2004    200

1   2004    250

1   2010    300

1   2010    320

1   2011  400

1   2011  400

:    :         :

:    :         :

1  2019  460

1  2019  460

 

Above is my sample dataset and i want to generate below output.

 

data output

 

ID YEAR  Amount

1   2002   100  

1   2002   .

1   2003   .

1   2003   .

1   2004    250

1   2004    .

1   2005    .

1   2005    .

1   2006    .

1   2006    .

1   2007    .

1   2007    .

1   2008    .

1   2008    .

1   2009    .

1   2009    .

1   2010    300

1   2010    320

1   2011    400

1   2011    400

:    :          :

:    :          :

1  2019   460

1  2019   460

 

i am using sas enterprise guide.8.1

 

can someone suggest me a way how can i do this.

 

thank you.

Anandsinh

 

1 ACCEPTED SOLUTION

Accepted Solutions
RichardDeVen
Barite | Level 11

A judicious use use dual SET statements and LAG function can be the basis of a short DATA Step program.

Edit: The expression NOT FIRST.ID is used to prevent gap looping between ID's (Edit replaced _N_>1).

 

Example:

 

data have; input
ID Year Amount; datalines;
1 2002 100
1 2004 200
1 2004 250
1 2010 300
1 2010 320
1 2011 400
1 2011 400
1 2019 460
1 2019 460
;

data want;
  set have;

/* loop over gap years filling in two observations each with missing amounts */
do year = sum(lag(year),1) to year-1 while (NOT FIRST.ID); call missing(amount); OUTPUT; OUTPUT; end;
/* reread data in sync with first SET in order to
* re-retrieve AMOUNT that may have been overwritten within gap loop
*/
set have; by id year; OUTPUT;
if first.year and last.year then do; /* check if id/year has only one row */ call missing(amount); OUTPUT; /* output missing AMOUNT for second row of id/year */ end; run;

 

An alternative to dual SET would use temporary variables (hold_*) to track values that would get munged by the gap loop.

data want(drop=hold:);
  set have;
  by id year;

  hold_year = year;
  hold_amount = amount;
  call missing (amount);

  do year = sum(lag(year),1) to year-1 while (NOT FIRST.ID);
    OUTPUT;
    OUTPUT;
  end;

  year = hold_year;
  amount = hold_amount;

  OUTPUT;
  if first.year and last.year then do;
    call missing(amount);
    OUTPUT;
  end;
run;

 

 

View solution in original post

11 REPLIES 11
PeterClemmensen
Tourmaline | Level 20

Why is Amount=. for the second obs of Year=2004?

AnandsinhSindha
Obsidian | Level 7

2 bill generated for each year...

 

if bill not generated two times per year, i need to generate with missing amount.

 

that's why amount =. in second obs of 2004 in output dataset

PeterClemmensen
Tourmaline | Level 20

But there are two obs in the input data set with year=2004 and Amount=200 and 250 respectively? 

AnandsinhSindha
Obsidian | Level 7

Sorry its my mistake in output data set...

 

 

AnandsinhSindha
Obsidian | Level 7

The output dataset should be like this....

 

data output

 

ID YEAR  Amount

1   2002   100  

1   2002   .

1   2003   .

1   2003   .

1   2004    200

1   2004    250

1   2005    .

1   2005    .

1   2006    .

1   2006    .

1   2007    .

1   2007    .

1   2008    .

1   2008    .

1   2009    .

1   2009    .

1   2010    300

1   2010    320

1   2011    400

1   2011    400

:    :          :

:    :          :

1  2019   460

1  2019   460

PeterClemmensen
Tourmaline | Level 20

Try this. I added another ID to the example data with another Year Range. I assume you want to be able to handle multiple by-groups.

 

data have;
input ID Year Amount;
datalines;
1 2002 100
1 2004 200
1 2004 250
1 2010 300
1 2010 320
1 2011 400
1 2011 400
1 2019 460
1 2019 460
2 2001 100
2 2004 200
2 2004 250
2 2010 300
2 2010 320
2 2011 400
2 2011 400
2 2018 460
2 2018 460
;

data want (drop=i s e);
   if _N_ = 1 then do;
      dcl hash h(multidata : "Y", ordered : "Y");
      h.definekey("Year");
      h.definedata("Year", "Amount");
      h.definedone();
      dcl hiter hi("h");
   end;

   do until (last.ID);
      set have;
      by ID;
      h.add();
   end;

   _N_ = hi.first(); s = Year;
   _N_ = hi.last();  e = Year;
   _N_ = hi.next();

   do Year = s to e;
      do i = 0 by 1 while (h.do_over() = 0);
         output;
      end;
      if i = 0 then do;
         Amount = .;
         output; output;
      end;
      if i = 1 then do;
         Amount = .;
         output;
      end;
   end;

   _N_ = h.clear();

run;
DaveShea
Lapis Lazuli | Level 10

Hi Anandsinh,

Try this idea. It has several steps, but it has the advantage of slowing down what is going on, so that you can understand how it works and decide whether it suits your needs. It also has lot of comments and so appears bigger than it really is.

I'm going to pretend for a moment that you really do want every row to have ID=1 🤔

 

**************************************************************;
*Create the original dataset supplied by Anandsinh            ;
**************************************************************;
Data HAVE;
 Input  ID
        Year
        Amount
        ;
DATALINES;
1 2002 100
1 2004 200
1 2004 250
1 2010 300
1 2010 320
1 2011 400
1 2011 400
1 2019 460
1 2019 460
;
Run;

**************************************************************;
*Sort the dataset so that we can assign the Sequence number   ;
*we need, temporarily.                                        ;
**************************************************************;
Proc Sort Data=HAVE;
 By Year;
Run;

*******************************************************************;
*If we are on the first row for a given Year, assign Sequence=1    ;
*If we are on the last row for a given year and it is also NOT the ;
*first row for that year, assign Sequence=2                        ;
*******************************************************************;
Data HAVE;
 Set HAVE;
 By Year;

 If First.Year Then Sequence=1;
 If Last.Year  And NOT First.Year Then Sequence=2;

Run;

**************************************************************;
*Figure out what range of years we currently have.            ;
**************************************************************;
Proc SQL NoPrint;
 Create Table HAVE_YEARS As
 Select Min(Year) As Min_Year,
        Max(Year) As Max_Year
 From   HAVE
;
Quit;

**************************************************************;
*Create an empty dataset of just two rows for each year in    ;
*our range.                                                   ;
**************************************************************;
Data ALL_YEARS;
 Set HAVE_YEARS;

 Do i=Min_Year To Max_Year;
    ID=1;
    Year=i;
    Amount=.;
    **************************************************************;
    *Output each row TWICE, each time with a different Sequence.  ;
    **************************************************************;
    Sequence=1;
    Output;
    Sequence=2;
    Output;
 End;

Keep    ID
        Year
        Amount
        Sequence
        ;
Run;

**************************************************************;
*Sort the datasets prior to merging them together.            ;
**************************************************************;
Proc Sort Data=HAVE;
 By Year
    Sequence
    ;
Run;

Proc Sort Data=ALL_YEARS;
 By Year
    Sequence
    ;
Run;

**************************************************************;
*Merge the original data with the empty frame of the ALL_YEARS;
*dataset.                                                     ;
**************************************************************;
Data WANT;
 Merge ALL_YEARS
       HAVE
       ;
       
 By Year
    Sequence
    ;

**************************************************************;
*Optionally DROP the Sequence variable.                       ;
**************************************************************;
Drop Sequence
     ;

Run;

Good luck, and ask if anything is not clear.

 

Downunder Dave
Wellington
New Zealand

andreas_lds
Jade | Level 19

And here is another solution using data step and proc sort. Maybe i missed some crucial information and the code is to simple:

 

data want_simple;
   set have;
   by Id Year;

   retain expectedYear;

   if first.Id then do;
      expectedYear = year;
   end;

   output;

   if first.Year and last.Year then do;
      amount = .;
      output;
   end;

   if last.Year then do;
      if expectedYear < year then do;
         do year = expectedYear to year-1;
            amount = .;
            output;
            output;
         end;
      end;
      expectedYear = year +1;
   end;

   drop expectedYear;
run;

proc sort data=work.want_simple;
   by id Year;
run;
RichardDeVen
Barite | Level 11

A judicious use use dual SET statements and LAG function can be the basis of a short DATA Step program.

Edit: The expression NOT FIRST.ID is used to prevent gap looping between ID's (Edit replaced _N_>1).

 

Example:

 

data have; input
ID Year Amount; datalines;
1 2002 100
1 2004 200
1 2004 250
1 2010 300
1 2010 320
1 2011 400
1 2011 400
1 2019 460
1 2019 460
;

data want;
  set have;

/* loop over gap years filling in two observations each with missing amounts */
do year = sum(lag(year),1) to year-1 while (NOT FIRST.ID); call missing(amount); OUTPUT; OUTPUT; end;
/* reread data in sync with first SET in order to
* re-retrieve AMOUNT that may have been overwritten within gap loop
*/
set have; by id year; OUTPUT;
if first.year and last.year then do; /* check if id/year has only one row */ call missing(amount); OUTPUT; /* output missing AMOUNT for second row of id/year */ end; run;

 

An alternative to dual SET would use temporary variables (hold_*) to track values that would get munged by the gap loop.

data want(drop=hold:);
  set have;
  by id year;

  hold_year = year;
  hold_amount = amount;
  call missing (amount);

  do year = sum(lag(year),1) to year-1 while (NOT FIRST.ID);
    OUTPUT;
    OUTPUT;
  end;

  year = hold_year;
  amount = hold_amount;

  OUTPUT;
  if first.year and last.year then do;
    call missing(amount);
    OUTPUT;
  end;
run;

 

 

Ksharp
Super User
data have;
input ID Year Amount;
datalines;
1 2002 100
1 2004 200
1 2004 250
1 2010 300
1 2010 320
1 2011 400
1 2011 400
1 2019 460
1 2019 460
2 2001 100
2 2004 200
2 2004 250
2 2010 300
2 2010 320
2 2011 400
2 2011 400
2 2018 460
2 2018 460
;
proc sql;
create table temp as
select id,min(year) as min,max(year) as max from have group by id;
quit;
data temp1;
 set temp;
 do year=min to max;
   output;output;
 end;
keep id year;
run;
data want;
 merge temp1 have;
 by id year;
 output;
 call missing(amount);
run;
AnandsinhSindha
Obsidian | Level 7

Thank you so much for your quick response

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1344 views
  • 2 likes
  • 6 in conversation