DATA Step, Macro, Functions and more

Combining and filling in gaps in datasets

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 9
Accepted Solution

Combining and filling in gaps in datasets

I currently have two datasets. In each dataset, observations are organized by two variables: date (in the form of day/month) and a 3-digit alphanumeric code. For each date/code combo there is a count (usually either 0 or 1). There are 7 days in the range (say Jul. 1-7). One dataset has observations from 2014, and the other dataset has a combined average observation from 2012 and 2013, but I have removed the year from the date so that I will be able to merge the datasets. Some codes have no counts for a given day in one or the other of the datasets because that code simply wasn't used on a given day; and some of the possible codes (of which there are 516, contained in a finite list in another spreadsheet) were not used in one or the other of the datasets at all. When a code from the master list wasn't originally used in that 7-day period, I used proc sql to insert those missing codes into the dataset (thanks to help provided in a previous thread here). For such codes, there is only one observation with a missing (.) date and a count of 0. What I want to do is: A) combine the two datasets, by code and date; and B) somehow fill in missing days so that each code has an observation for each of the 7 days (Jul. 1-7), with the count being 0 if there was originally no observation.

An example of the current structure of Dataset 1 is:

Code   Date       Observed Count (2014)

1X1     07/01         1

1X1     07/03         1

1X1     07/06         0

2Y3     .                 0

3J8     07/02         1

3J8     07/03         0

An example of the current structure of Dataset 2 is:

Code   Date       Observed Count (2013/2012 Avg)

1X1     07/01         0

1X1     07/02         1

1X1     07/04         0

2Y3     07/01         1

3J8     .                  0

What I would like the final dataset to look like is:

Code   Date       Observed Count (2014)    Observed Count (2013/2012 Avg)

1X1     07/01         1                                      0

1X1     07/02         0                                      1

1X1     07/03         1                                      0

1X1     07/04         0                                      0

1X1     07/05         0                                      0

1X1     07/06         0                                      0

1X1     07/07         0                                      0

2Y3     07/01         0                                      1

2Y3     07/02         0                                      0

2Y3    07/03          0                                      0

2Y3     07/04         0                                      0

2Y3    07/05         0                                       0

2Y3     07/06         0                                      0

2Y3     07/07         0                                      0

3J8     07/01          0                                      0

3J8     07/02         1                                       0

3J8     07/03         0                                       0

3J8     07/04         0                                       0

3J8     07/05         0                                       0

3J8     07/06         0                                       0

3J8     07/07         0                                       0

I am not sure how to go about combining and filling in gaps in the datasets in this manner. I did find a somewhat similar example using PROC EXPAND online, but wasn't sure how to adapt it to this scenario. Any guidance on how to achieve this task would be sincerely appreciated. Thank you!


Accepted Solutions
Solution
‎01-26-2015 06:05 AM
Super Contributor
Posts: 255

Re: Combining and filling in gaps in datasets

When ARRAY can be used, HASH Object too can be used.

Here is Hash version.

BY GROUP processing with CLEAR() would be far easier as done with the ARRAY. No need to have BY GROUP processing. No need to change the date to number. Both Data Sets are read sequentially and the hash table is built.

data one;

input code $3. @9 date $5. count;

datalines;

1X1     07/01         1

1X1     07/03         1

1X1     07/06         0

2Y3     .             0

3J8     07/02         1

3J8     07/03         0

;

run;

data two;

input code $3. @9 date $5. count;

datalines;

1X1     07/01         0

1X1     07/02         1

1X1     07/04         0

2Y3     07/01         1

3J8     .             0

;

run;

data want;

   if _n_ = 1 then do;

      if 0 then set one;

      declare hash h(ordered:'a');

      h.definekey('code','date');

      h.definedata('code', 'date', 'count_14', 'count_13');

      h.definedone();

      declare hash h1(ordered:'y');

      h1.definekey('code');

      h1.definedone();

      declare hiter hi('h1');

   end;

   do until(last);

      set one(in = a) two(in = b) end = last;

    

      if h.find() ^= 0 then do; count_14 = 0;  count_13 = 0; end;

      if a then count_14 = count;

      if b then count_13 = count;

      if not missing(date) then h.replace();

     

      if h1.find() ^= 0 then h1.add();

   end;

   do while(hi.next() = 0);

      do Date = '07/01','07/02','07/03','07/04','07/05','07/06','07/07';

         rc = h.find();

         if rc ^= 0 then do;

            count_14 = 0;

            count_13 = 0;

         end;

         output;

      end;

   end;

drop rc count;     

run;

View solution in original post


All Replies
Super User
Posts: 17,842

Re: Combining and filling in gaps in datasets

This isn't the usual way I'd recommend but its a way, if you have SAS ETS licensed.

How I usually recommend doing it, no SAS ETS required:

Super Contributor
Posts: 255

Re: Combining and filling in gaps in datasets

No need for ETS. Simple data step is sufficient to get an elegant solution.

Change Date into  a Number by using the MDY() function. Use one array for 2014(say, k_14[ ])  and anther array for average of 2012 & 13 (say, k_13[ ]). In double DoW-loops, read observations from the two data sets by BY GROUP (by CODE). Then merge the two arrays into one and output the results.

data one(drop=date dd mm);

input code $3. @9 date $5. count;

mm = substr(date,1,2);

dd = substr(date, 4);

ndate = mdy(mm,dd, 2014);

datalines;

1X1     07/01         1

1X1     07/03         1

1X1     07/06         0

2Y3     .             0

3J8     07/02         1

3J8     07/03         0

;

run;

data two(drop=date dd mm);

input code $3. @9 date $5. count;

mm = substr(date,1,2);

dd = substr(date, 4);

ndate = mdy(mm,dd, 2014);

datalines;

1X1     07/01         0

1X1     07/02         1

1X1     07/04         0

2Y3     07/01         1

3J8     .             0

;

run;

data want(rename = (i = Date));

array k_14[19905:19911] _temporary_;

array k_13[19905:19911] _temporary_;

   do until(last.code);

      set one;

      by code;

      if not missing(ndate) then k_14[ndate] = count;

   end;

   do until(last.code);

      set two;

      by code;

      if not missing(ndate) then k_13[ndate] = count;

   end;

   do i = lbound(k_14) to hbound(k_14);

      count_14 = ifN(k_14 > 0, k_14, 0);

      count_13 = ifN(k_13 > 0, k_13, 0);

      output;

   end;

   call missing(of k_14

  • , of k_13
  • );
  • keep code i count_14 count_13;

    format i date5.;

    run;

    Super User
    Posts: 9,682

    Re: Combining and filling in gaps in datasets

    data one;
    input code $3. @9 date $5. count;
    datalines;
    1X1     07/01         1
    1X1     07/03         1
    1X1     07/06         0
    2Y3     .             0
    3J8     07/02         1
    3J8     07/03         0
    ;
    run;
     
    data two;
    input code $3. @9 date $5. count;
    datalines;
    1X1     07/01         0
    1X1     07/02         1
    1X1     07/04         0
    2Y3     07/01         1
    3J8     .             0
    ;
    run;
    data date;
     do i='01jul2014'd to '07jul2014'd;
      date=put(i,mmddyy5.);output;
     end;
     drop i;
    run;
    proc sql;
    create table want as
    select x.*,count_2013_2012_avg from
    (
    select a.*,case when missing(b.count) then 0 else b.count end as count_2014  from
     (select * from (select distinct code from one),date) as a left join one as b
     on a.code=b.code and a.date=b.date
    ) as x
    ,
    (
    select a1.*,case when missing(b1.count) then 0 else b1.count end as count_2013_2012_avg  from
     (select * from (select distinct code from two),date) as a1 left join two as b1
     on a1.code=b1.code and a1.date=b1.date
    ) as y
    where x.code=y.code and x.date=y.date
    order by 1,2;
    
    
    quit;
    
    

    Xia Keshan

    Solution
    ‎01-26-2015 06:05 AM
    Super Contributor
    Posts: 255

    Re: Combining and filling in gaps in datasets

    When ARRAY can be used, HASH Object too can be used.

    Here is Hash version.

    BY GROUP processing with CLEAR() would be far easier as done with the ARRAY. No need to have BY GROUP processing. No need to change the date to number. Both Data Sets are read sequentially and the hash table is built.

    data one;

    input code $3. @9 date $5. count;

    datalines;

    1X1     07/01         1

    1X1     07/03         1

    1X1     07/06         0

    2Y3     .             0

    3J8     07/02         1

    3J8     07/03         0

    ;

    run;

    data two;

    input code $3. @9 date $5. count;

    datalines;

    1X1     07/01         0

    1X1     07/02         1

    1X1     07/04         0

    2Y3     07/01         1

    3J8     .             0

    ;

    run;

    data want;

       if _n_ = 1 then do;

          if 0 then set one;

          declare hash h(ordered:'a');

          h.definekey('code','date');

          h.definedata('code', 'date', 'count_14', 'count_13');

          h.definedone();

          declare hash h1(ordered:'y');

          h1.definekey('code');

          h1.definedone();

          declare hiter hi('h1');

       end;

       do until(last);

          set one(in = a) two(in = b) end = last;

        

          if h.find() ^= 0 then do; count_14 = 0;  count_13 = 0; end;

          if a then count_14 = count;

          if b then count_13 = count;

          if not missing(date) then h.replace();

         

          if h1.find() ^= 0 then h1.add();

       end;

       do while(hi.next() = 0);

          do Date = '07/01','07/02','07/03','07/04','07/05','07/06','07/07';

             rc = h.find();

             if rc ^= 0 then do;

                count_14 = 0;

                count_13 = 0;

             end;

             output;

          end;

       end;

    drop rc count;     

    run;

    Super Contributor
    Posts: 305

    Re: Combining and filling in gaps in datasets

    Hello,

    You can first fill-in the missing series and afterwards concatenate the datasets:

    data dataset1 (drop=date);
    input Code $ Date $ Count_2014;
    if missing(Date)=0 then
    date_num=mdy(input(substr(date,1,2),3.),input(substr(date,4,2),3.),2014);

    format date_num date7.;
    datalines;
    1X1 07/01 1
    1X1 07/03 1
    1X1 07/06 0
    2Y3 . 0
    3J8 07/02 1
    3J8 07/03 0
    4J8 07/02 1
    4J8 07/07 0
    ;
    run;
    data dataset2 (drop=date);
    input Code $ Date $ Count_2012_2013;
    if missing(Date)=0 then
    date_num=mdy(input(substr(date,1,2),3.),input(substr(date,4,2),3.),2014);

    format date_num date7.;
    datalines;
    1X1 07/01 0
    1X1 07/02 1
    1X1 07/04 0
    2Y3 07/01 1
    3J8 . 0
    4J8 07/07 0
    ;
    run;

    data have1;

    format date date7.;
    set dataset1;
    by Code;

    retain date;

    if first.Code then date="01JUL2014"d;

    do i=date to "07JUL2014"d while (date_num ne date);
    Observed_Count_2014=0;
    output;
    date+1;
    ;
    end;

    if date_num=date then do;Observed_Count_2014=count_2014;output;date+1;end;


    if last.Code and date_num not in(.,"07JUL2014"d)  and date le "07JUL2014"d then
    do;
    do until(date="08JUL2014"d) ;
      Observed_Count_2014=0;
      output;
      date+1;
    end;
    end;

    keep code date Observed_Count_2014;
    run;

    data have2;

    format date date7.;
    set dataset2;
    by Code;

    retain date;

    if first.Code then date="01JUL2014"d;

    do i=date to "07JUL2014"d while (date_num ne date);
    Observed_Count_2012_2013=0;
    output;
    date+1;
    ;
    end;

    if date_num=date then do;Observed_Count_2012_2013=count_2012_2013;output;date+1;end;


    if last.Code and date_num not in(.,"07JUL2014"d)  and date le "07JUL2014"d then
    do;
    do until(date="08JUL2014"d) ;
      Observed_Count_2012_2013=0;
      output;
      date+1;
    end;
    end;

    keep code date Observed_Count_2012_2013;
    run;

    proc sql;
    create table want as
    select h1.code,h1.date,Observed_Count_2014,Observed_Count_2012_2013 from have1 as h1 inner join have2 as h2
    on h1.code=h2.code and h1.date=h2.date;
    quit;

    Occasional Contributor
    Posts: 9

    Re: Combining and filling in gaps in datasets

    Thank you Reeza, datasp, Xia, and Loko! I truly appreciate all your very helpful answers. I ended up using the second solution proposed by datasp, but I'm sure any of these would work well. It's great to see the different ways this can be achieved. Thanks again for the help!

    ☑ This topic is SOLVED.

    Need further help from the community? Please ask a new question.

    Discussion stats
    • 6 replies
    • 743 views
    • 6 likes
    • 5 in conversation