Solved: Add seq year and add missing values

AnandsinhSindha · Posted 09-01-2020 04:05 AM

Need help in this data step

i have data sample...

Every year two bill generate for each id,and if year is missing i need to add year with missing amount.

data have

ID Year Amount

1 2002 100

1 2004 200

1 2004 250

1 2010 300

1 2010 320

1 2011 400

: : :

1 2019 460

Above is my sample dataset and i want to generate below output.

data output

ID YEAR Amount

1 2002 100

1 2002 .

1 2003 .

1 2004 250

1 2004 .

1 2005 .

1 2006 .

1 2007 .

1 2008 .

1 2009 .

1 2010 300

1 2010 320

1 2011 400

: : :

1 2019 460

i am using sas enterprise guide.8.1

can someone suggest me a way how can i do this.

thank you.

Anandsinh

RichardDeVen · Posted 09-01-2020 07:32 AM

A judicious use use dual SET statements and LAG function can be the basis of a short DATA Step program.

Edit: The expression NOT FIRST.ID is used to prevent gap looping between ID's (Edit replaced _N_>1).

Example:

data have; input
ID Year Amount; datalines;
1 2002 100
1 2004 200
1 2004 250
1 2010 300
1 2010 320
1 2011 400
1 2011 400
1 2019 460
1 2019 460
;

data want;
  set have;

  /* loop over gap years filling in two observations each with missing amounts */

  do year = sum(lag(year),1) to year-1 while (NOT FIRST.ID);
    call missing(amount);
    OUTPUT;
    OUTPUT;
  end;

  /* reread data in sync with first SET in order to
   * re-retrieve AMOUNT that may have been overwritten within gap loop
   */

  set have;
  by id year;

  OUTPUT;

  if first.year and last.year then do; /* check if id/year has only one row */
    call missing(amount);
    OUTPUT;                            /* output missing AMOUNT for second row of id/year */
  end;
run;

An alternative to dual SET would use temporary variables (hold_*) to track values that would get munged by the gap loop.

data want(drop=hold:);
  set have;
  by id year;

  hold_year = year;
  hold_amount = amount;
  call missing (amount);

  do year = sum(lag(year),1) to year-1 while (NOT FIRST.ID);
    OUTPUT;
    OUTPUT;
  end;

  year = hold_year;
  amount = hold_amount;

  OUTPUT;
  if first.year and last.year then do;
    call missing(amount);
    OUTPUT;
  end;
run;

View solution in original post

PeterClemmensen · Posted 09-01-2020 04:23 AM

Why is Amount=. for the second obs of Year=2004?

The DATA to DATA Step Macro
Blog: SASnrd

AnandsinhSindha · Posted 09-01-2020 04:36 AM

2 bill generated for each year...

if bill not generated two times per year, i need to generate with missing amount.

that's why amount =. in second obs of 2004 in output dataset

PeterClemmensen · Posted 09-01-2020 04:37 AM

But there are two obs in the input data set with year=2004 and Amount=200 and 250 respectively?

The DATA to DATA Step Macro
Blog: SASnrd

AnandsinhSindha · Posted 09-01-2020 04:40 AM

Sorry its my mistake in output data set...

AnandsinhSindha · Posted 09-01-2020 04:43 AM

The output dataset should be like this....

data output

ID YEAR Amount

1 2002 100

1 2002 .

1 2003 .

1 2004 200

1 2004 250

1 2005 .

1 2006 .

1 2007 .

1 2008 .

1 2009 .

1 2010 300

1 2010 320

1 2011 400

: : :

1 2019 460

PeterClemmensen · Posted 09-01-2020 05:25 AM

Try this. I added another ID to the example data with another Year Range. I assume you want to be able to handle multiple by-groups.

data have;
input ID Year Amount;
datalines;
1 2002 100
1 2004 200
1 2004 250
1 2010 300
1 2010 320
1 2011 400
1 2011 400
1 2019 460
1 2019 460
2 2001 100
2 2004 200
2 2004 250
2 2010 300
2 2010 320
2 2011 400
2 2011 400
2 2018 460
2 2018 460
;

data want (drop=i s e);
   if _N_ = 1 then do;
      dcl hash h(multidata : "Y", ordered : "Y");
      h.definekey("Year");
      h.definedata("Year", "Amount");
      h.definedone();
      dcl hiter hi("h");
   end;

   do until (last.ID);
      set have;
      by ID;
      h.add();
   end;

   _N_ = hi.first(); s = Year;
   _N_ = hi.last();  e = Year;
   _N_ = hi.next();

   do Year = s to e;
      do i = 0 by 1 while (h.do_over() = 0);
         output;
      end;
      if i = 0 then do;
         Amount = .;
         output; output;
      end;
      if i = 1 then do;
         Amount = .;
         output;
      end;
   end;

   _N_ = h.clear();

run;

The DATA to DATA Step Macro
Blog: SASnrd

DaveShea · Posted 09-01-2020 05:34 AM

Hi Anandsinh,

Try this idea. It has several steps, but it has the advantage of slowing down what is going on, so that you can understand how it works and decide whether it suits your needs. It also has lot of comments and so appears bigger than it really is.

I'm going to pretend for a moment that you really do want every row to have ID=1 🤔

**************************************************************;
*Create the original dataset supplied by Anandsinh            ;
**************************************************************;
Data HAVE;
 Input  ID
        Year
        Amount
        ;
DATALINES;
1 2002 100
1 2004 200
1 2004 250
1 2010 300
1 2010 320
1 2011 400
1 2011 400
1 2019 460
1 2019 460
;
Run;

**************************************************************;
*Sort the dataset so that we can assign the Sequence number   ;
*we need, temporarily.                                        ;
**************************************************************;
Proc Sort Data=HAVE;
 By Year;
Run;

*******************************************************************;
*If we are on the first row for a given Year, assign Sequence=1    ;
*If we are on the last row for a given year and it is also NOT the ;
*first row for that year, assign Sequence=2                        ;
*******************************************************************;
Data HAVE;
 Set HAVE;
 By Year;

 If First.Year Then Sequence=1;
 If Last.Year  And NOT First.Year Then Sequence=2;

Run;

**************************************************************;
*Figure out what range of years we currently have.            ;
**************************************************************;
Proc SQL NoPrint;
 Create Table HAVE_YEARS As
 Select Min(Year) As Min_Year,
        Max(Year) As Max_Year
 From   HAVE
;
Quit;

**************************************************************;
*Create an empty dataset of just two rows for each year in    ;
*our range.                                                   ;
**************************************************************;
Data ALL_YEARS;
 Set HAVE_YEARS;

 Do i=Min_Year To Max_Year;
    ID=1;
    Year=i;
    Amount=.;
    **************************************************************;
    *Output each row TWICE, each time with a different Sequence.  ;
    **************************************************************;
    Sequence=1;
    Output;
    Sequence=2;
    Output;
 End;

Keep    ID
        Year
        Amount
        Sequence
        ;
Run;

**************************************************************;
*Sort the datasets prior to merging them together.            ;
**************************************************************;
Proc Sort Data=HAVE;
 By Year
    Sequence
    ;
Run;

Proc Sort Data=ALL_YEARS;
 By Year
    Sequence
    ;
Run;

**************************************************************;
*Merge the original data with the empty frame of the ALL_YEARS;
*dataset.                                                     ;
**************************************************************;
Data WANT;
 Merge ALL_YEARS
       HAVE
       ;
       
 By Year
    Sequence
    ;

**************************************************************;
*Optionally DROP the Sequence variable.                       ;
**************************************************************;
Drop Sequence
     ;

Run;

Good luck, and ask if anything is not clear.

Downunder Dave
Wellington
New Zealand

andreas_lds · Posted 09-01-2020 06:51 AM

And here is another solution using data step and proc sort. Maybe i missed some crucial information and the code is to simple:

data want_simple;
   set have;
   by Id Year;

   retain expectedYear;

   if first.Id then do;
      expectedYear = year;
   end;

   output;

   if first.Year and last.Year then do;
      amount = .;
      output;
   end;

   if last.Year then do;
      if expectedYear < year then do;
         do year = expectedYear to year-1;
            amount = .;
            output;
            output;
         end;
      end;
      expectedYear = year +1;
   end;

   drop expectedYear;
run;

proc sort data=work.want_simple;
   by id Year;
run;

RichardDeVen · Posted 09-01-2020 07:32 AM

A judicious use use dual SET statements and LAG function can be the basis of a short DATA Step program.

Edit: The expression NOT FIRST.ID is used to prevent gap looping between ID's (Edit replaced _N_>1).

Example:

data have; input
ID Year Amount; datalines;
1 2002 100
1 2004 200
1 2004 250
1 2010 300
1 2010 320
1 2011 400
1 2011 400
1 2019 460
1 2019 460
;

data want;
  set have;

  /* loop over gap years filling in two observations each with missing amounts */

  do year = sum(lag(year),1) to year-1 while (NOT FIRST.ID);
    call missing(amount);
    OUTPUT;
    OUTPUT;
  end;

  /* reread data in sync with first SET in order to
   * re-retrieve AMOUNT that may have been overwritten within gap loop
   */

  set have;
  by id year;

  OUTPUT;

  if first.year and last.year then do; /* check if id/year has only one row */
    call missing(amount);
    OUTPUT;                            /* output missing AMOUNT for second row of id/year */
  end;
run;

An alternative to dual SET would use temporary variables (hold_*) to track values that would get munged by the gap loop.

data want(drop=hold:);
  set have;
  by id year;

  hold_year = year;
  hold_amount = amount;
  call missing (amount);

  do year = sum(lag(year),1) to year-1 while (NOT FIRST.ID);
    OUTPUT;
    OUTPUT;
  end;

  year = hold_year;
  amount = hold_amount;

  OUTPUT;
  if first.year and last.year then do;
    call missing(amount);
    OUTPUT;
  end;
run;

Ksharp · Posted 09-01-2020 07:32 AM

data have;
input ID Year Amount;
datalines;
1 2002 100
1 2004 200
1 2004 250
1 2010 300
1 2010 320
1 2011 400
1 2011 400
1 2019 460
1 2019 460
2 2001 100
2 2004 200
2 2004 250
2 2010 300
2 2010 320
2 2011 400
2 2011 400
2 2018 460
2 2018 460
;
proc sql;
create table temp as
select id,min(year) as min,max(year) as max from have group by id;
quit;
data temp1;
 set temp;
 do year=min to max;
   output;output;
 end;
keep id year;
run;
data want;
 merge temp1 have;
 by id year;
 output;
 call missing(amount);
run;

AnandsinhSindha · Posted 09-02-2020 01:10 AM

Thank you so much for your quick response

Add seq year and add missing values

Re: Add seq year and add missing values

Re: Add seq year and add missing values

Re: Add seq year and add missing values

Re: Add seq year and add missing values

Re: Add seq year and add missing values

Re: Add seq year and add missing values

Re: Add seq year and add missing values

Re: Add seq year and add missing values

Re: Add seq year and add missing values

Re: Add seq year and add missing values

Re: Add seq year and add missing values

Re: Add seq year and add missing values

Classroom Training Available!