Solved: Re: SAS programming: add missing rows to a long file

hein68 · Posted 11-06-2024 10:58 AM

I have a long file with data like this:

Some projids have one or more of the 13 rows missing, but it's random as to which row(s) are missing. What is the easiest way to write code to add the rows that are missing for each projid?

Thanks!!

mkeintz · Posted 11-06-2024 02:16 PM

This program has two data steps, but only demands the original data from storage be retrieved once, because it uses a data set view.

proc format;
  value timept 0="Baseline" 1="1 year" 2="2 year" 3="3 year"
		4="4 year" 5="5 year" 6="6 year" 7="7 year" 8="8 year"
		9="9 year" 10="10 year" 11="11 year" 12="12 year" 13="13 year"
		14="14 year" 15="15 year" 16="16 year" 17="17 year"
		18="18 year" 19="19 year";

data years;
input id time var1;
format time timept. ;
cards;
12345  0   0.089
12345  1   0.099
12345  3   0.0762
12345  5   0.103
12345  6   0.089
12345  7   0.089
12345  8   0.089
12345  9   0.089
12345  10   0.089
12345  11   0.089
run;

data all_time_values/view=all_time_values ;
  set years (keep=id);
  by id;
  if first.id;
  do time=0 to 13;  /* Baseline to Year 13 */
    output;
  end;
run;


data want ;
  merge all_time_values years;
  by id time;
run;

This assumes the data are sorted by ID and time.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

andreas_lds · Posted 11-06-2024 11:19 AM

How many obs should each projid have? Is the number constant?

What is the real contents of "time"? Can you show the unformatted values?

Can you please post data in usable form, so that we have something to work with?

Usable form = data step with datalines.

PaigeMiller · Posted 11-06-2024 11:54 AM

Here is my code that works on fake data. This could be extremely slow on large data sets. But really, we need answers to all the questions from @andreas_lds

data years;
input id time var1;
cards;
12345  1   0.089
12345  2   0.099
12345  3   0.0762
12345  5   0.103
12345  6   0.089
12345  7   0.089
12345  8   0.089
12345  9   0.089
12345 10   0.089
12345 11   0.089
;
data thirteen;
    do time=1 to 13;
        output;
    end;
run;

proc sql;
    create table ids as select distinct id from years;
    create table all_years as select distinct a.id,b.time from years as a,thirteen as b; /* Cartesian join */
    create table want as select a.id,a.time,b.var1 from all_years as a left join years as b
        on a.id=b.id and a.time=b.time;
quit;

--
Paige Miller

hein68 · Posted 11-06-2024 12:39 PM

Each projid should have 13 observations/rows. In the example I show in the screenshot, the original data source does not include a row for observation 10 (for 10 years). I need to get a blank row added that would be for observation 10 (10 years).

In another example, the projid had 2 observations (baseline and 1 year), so in this case I need to add the blank rows for years 3 through 13.

This code shows how the time variable is formatted.

value timept 0="Baseline" 1="1 year" 2="2 year" 3="3 year"
		4="4 year" 5="5 year" 6="6 year" 7="7 year" 8="8 year"
		9="9 year" 10="10 year" 11="11 year" 12="12 year" 13="13 year"
		14="14 year" 15="15 year" 16="16 year" 17="17 year"
		18="18 year" 19="19 year";

Thanks!

data_null__ · Posted 11-06-2024 12:19 PM

This will fill missing rows at the end and in the middle.

data have;
   do projid='28533', '29344';
      do year = 0 to 3, 5;
         y = ranuni(1234);
         output;
         end;
      end;
   run;

proc format;
   value xyear(notsorted)
      0 = 'Baseline'
      1 = '1 Year'
      2 = '2 Year'
      3 = '3 Year'
      4 = '4 Year'
      5 = '5 Year'
      6 = '6 Year'
      7 = '7 Year'
      ;
   quit;
proc summary data=have nway missing completetypes;
   by projid;
   class year / preloadfmt;
   format year xyear.;
   output out=expanded(drop=_type_);
   run;
data want;
   merge expanded have;
   by projid year;
   run;
proc print;
   run;

mkeintz · Posted 11-06-2024 02:16 PM

This program has two data steps, but only demands the original data from storage be retrieved once, because it uses a data set view.

proc format;
  value timept 0="Baseline" 1="1 year" 2="2 year" 3="3 year"
		4="4 year" 5="5 year" 6="6 year" 7="7 year" 8="8 year"
		9="9 year" 10="10 year" 11="11 year" 12="12 year" 13="13 year"
		14="14 year" 15="15 year" 16="16 year" 17="17 year"
		18="18 year" 19="19 year";

data years;
input id time var1;
format time timept. ;
cards;
12345  0   0.089
12345  1   0.099
12345  3   0.0762
12345  5   0.103
12345  6   0.089
12345  7   0.089
12345  8   0.089
12345  9   0.089
12345  10   0.089
12345  11   0.089
run;

data all_time_values/view=all_time_values ;
  set years (keep=id);
  by id;
  if first.id;
  do time=0 to 13;  /* Baseline to Year 13 */
    output;
  end;
run;


data want ;
  merge all_time_values years;
  by id time;
run;

This assumes the data are sorted by ID and time.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hein68 · Posted 11-06-2024 04:01 PM

This worked beautifully! Thank you!

Registration is open

SAS Training: Just a Click Away