I currently have two datasets. In each dataset, observations are organized by two variables: date (in the form of day/month) and a 3-digit alphanumeric code. For each date/code combo there is a count (usually either 0 or 1). There are 7 days in the range (say Jul. 1-7). One dataset has observations from 2014, and the other dataset has a combined average observation from 2012 and 2013, but I have removed the year from the date so that I will be able to merge the datasets. Some codes have no counts for a given day in one or the other of the datasets because that code simply wasn't used on a given day; and some of the possible codes (of which there are 516, contained in a finite list in another spreadsheet) were not used in one or the other of the datasets at all. When a code from the master list wasn't originally used in that 7-day period, I used proc sql to insert those missing codes into the dataset (thanks to help provided in a previous thread here). For such codes, there is only one observation with a missing (.) date and a count of 0. What I want to do is: A) combine the two datasets, by code and date; and B) somehow fill in missing days so that each code has an observation for each of the 7 days (Jul. 1-7), with the count being 0 if there was originally no observation.
An example of the current structure of Dataset 1 is:
Code Date Observed Count (2014)
1X1 07/01 1
1X1 07/03 1
1X1 07/06 0
2Y3 . 0
3J8 07/02 1
3J8 07/03 0
An example of the current structure of Dataset 2 is:
Code Date Observed Count (2013/2012 Avg)
1X1 07/01 0
1X1 07/02 1
1X1 07/04 0
2Y3 07/01 1
3J8 . 0
What I would like the final dataset to look like is:
Code Date Observed Count (2014) Observed Count (2013/2012 Avg)
1X1 07/01 1 0
1X1 07/02 0 1
1X1 07/03 1 0
1X1 07/04 0 0
1X1 07/05 0 0
1X1 07/06 0 0
1X1 07/07 0 0
2Y3 07/01 0 1
2Y3 07/02 0 0
2Y3 07/03 0 0
2Y3 07/04 0 0
2Y3 07/05 0 0
2Y3 07/06 0 0
2Y3 07/07 0 0
3J8 07/01 0 0
3J8 07/02 1 0
3J8 07/03 0 0
3J8 07/04 0 0
3J8 07/05 0 0
3J8 07/06 0 0
3J8 07/07 0 0
I am not sure how to go about combining and filling in gaps in the datasets in this manner. I did find a somewhat similar example using PROC EXPAND online, but wasn't sure how to adapt it to this scenario. Any guidance on how to achieve this task would be sincerely appreciated. Thank you!
When ARRAY can be used, HASH Object too can be used.
Here is Hash version.
BY GROUP processing with CLEAR() would be far easier as done with the ARRAY. No need to have BY GROUP processing. No need to change the date to number. Both Data Sets are read sequentially and the hash table is built.
data one;
input code $3. @9 date $5. count;
datalines;
1X1 07/01 1
1X1 07/03 1
1X1 07/06 0
2Y3 . 0
3J8 07/02 1
3J8 07/03 0
;
run;
data two;
input code $3. @9 date $5. count;
datalines;
1X1 07/01 0
1X1 07/02 1
1X1 07/04 0
2Y3 07/01 1
3J8 . 0
;
run;
data want;
if _n_ = 1 then do;
if 0 then set one;
declare hash h(ordered:'a');
h.definekey('code','date');
h.definedata('code', 'date', 'count_14', 'count_13');
h.definedone();
declare hash h1(ordered:'y');
h1.definekey('code');
h1.definedone();
declare hiter hi('h1');
end;
do until(last);
set one(in = a) two(in = b) end = last;
if h.find() ^= 0 then do; count_14 = 0; count_13 = 0; end;
if a then count_14 = count;
if b then count_13 = count;
if not missing(date) then h.replace();
if h1.find() ^= 0 then h1.add();
end;
do while(hi.next() = 0);
do Date = '07/01','07/02','07/03','07/04','07/05','07/06','07/07';
rc = h.find();
if rc ^= 0 then do;
count_14 = 0;
count_13 = 0;
end;
output;
end;
end;
drop rc count;
run;
No need for ETS. Simple data step is sufficient to get an elegant solution.
Change Date into a Number by using the MDY() function. Use one array for 2014(say, k_14[ ]) and anther array for average of 2012 & 13 (say, k_13[ ]). In double DoW-loops, read observations from the two data sets by BY GROUP (by CODE). Then merge the two arrays into one and output the results.
data one(drop=date dd mm);
input code $3. @9 date $5. count;
mm = substr(date,1,2);
dd = substr(date, 4);
ndate = mdy(mm,dd, 2014);
datalines;
1X1 07/01 1
1X1 07/03 1
1X1 07/06 0
2Y3 . 0
3J8 07/02 1
3J8 07/03 0
;
run;
data two(drop=date dd mm);
input code $3. @9 date $5. count;
mm = substr(date,1,2);
dd = substr(date, 4);
ndate = mdy(mm,dd, 2014);
datalines;
1X1 07/01 0
1X1 07/02 1
1X1 07/04 0
2Y3 07/01 1
3J8 . 0
;
run;
data want(rename = (i = Date));
array k_14[19905:19911] _temporary_;
array k_13[19905:19911] _temporary_;
do until(last.code);
set one;
by code;
if not missing(ndate) then k_14[ndate] = count;
end;
do until(last.code);
set two;
by code;
if not missing(ndate) then k_13[ndate] = count;
end;
do i = lbound(k_14) to hbound(k_14);
count_14 = ifN(k_14 > 0, k_14, 0);
count_13 = ifN(k_13 > 0, k_13, 0);
output;
end;
call missing(of k_14
keep code i count_14 count_13;
format i date5.;
run;
data one; input code $3. @9 date $5. count; datalines; 1X1 07/01 1 1X1 07/03 1 1X1 07/06 0 2Y3 . 0 3J8 07/02 1 3J8 07/03 0 ; run; data two; input code $3. @9 date $5. count; datalines; 1X1 07/01 0 1X1 07/02 1 1X1 07/04 0 2Y3 07/01 1 3J8 . 0 ; run; data date; do i='01jul2014'd to '07jul2014'd; date=put(i,mmddyy5.);output; end; drop i; run; proc sql; create table want as select x.*,count_2013_2012_avg from ( select a.*,case when missing(b.count) then 0 else b.count end as count_2014 from (select * from (select distinct code from one),date) as a left join one as b on a.code=b.code and a.date=b.date ) as x , ( select a1.*,case when missing(b1.count) then 0 else b1.count end as count_2013_2012_avg from (select * from (select distinct code from two),date) as a1 left join two as b1 on a1.code=b1.code and a1.date=b1.date ) as y where x.code=y.code and x.date=y.date order by 1,2; quit;
Xia Keshan
When ARRAY can be used, HASH Object too can be used.
Here is Hash version.
BY GROUP processing with CLEAR() would be far easier as done with the ARRAY. No need to have BY GROUP processing. No need to change the date to number. Both Data Sets are read sequentially and the hash table is built.
data one;
input code $3. @9 date $5. count;
datalines;
1X1 07/01 1
1X1 07/03 1
1X1 07/06 0
2Y3 . 0
3J8 07/02 1
3J8 07/03 0
;
run;
data two;
input code $3. @9 date $5. count;
datalines;
1X1 07/01 0
1X1 07/02 1
1X1 07/04 0
2Y3 07/01 1
3J8 . 0
;
run;
data want;
if _n_ = 1 then do;
if 0 then set one;
declare hash h(ordered:'a');
h.definekey('code','date');
h.definedata('code', 'date', 'count_14', 'count_13');
h.definedone();
declare hash h1(ordered:'y');
h1.definekey('code');
h1.definedone();
declare hiter hi('h1');
end;
do until(last);
set one(in = a) two(in = b) end = last;
if h.find() ^= 0 then do; count_14 = 0; count_13 = 0; end;
if a then count_14 = count;
if b then count_13 = count;
if not missing(date) then h.replace();
if h1.find() ^= 0 then h1.add();
end;
do while(hi.next() = 0);
do Date = '07/01','07/02','07/03','07/04','07/05','07/06','07/07';
rc = h.find();
if rc ^= 0 then do;
count_14 = 0;
count_13 = 0;
end;
output;
end;
end;
drop rc count;
run;
Hello,
You can first fill-in the missing series and afterwards concatenate the datasets:
data dataset1 (drop=date);
input Code $ Date $ Count_2014;
if missing(Date)=0 then
date_num=mdy(input(substr(date,1,2),3.),input(substr(date,4,2),3.),2014);
format date_num date7.;
datalines;
1X1 07/01 1
1X1 07/03 1
1X1 07/06 0
2Y3 . 0
3J8 07/02 1
3J8 07/03 0
4J8 07/02 1
4J8 07/07 0
;
run;
data dataset2 (drop=date);
input Code $ Date $ Count_2012_2013;
if missing(Date)=0 then
date_num=mdy(input(substr(date,1,2),3.),input(substr(date,4,2),3.),2014);
format date_num date7.;
datalines;
1X1 07/01 0
1X1 07/02 1
1X1 07/04 0
2Y3 07/01 1
3J8 . 0
4J8 07/07 0
;
run;
data have1;
format date date7.;
set dataset1;
by Code;
retain date;
if first.Code then date="01JUL2014"d;
do i=date to "07JUL2014"d while (date_num ne date);
Observed_Count_2014=0;
output;
date+1;
;
end;
if date_num=date then do;Observed_Count_2014=count_2014;output;date+1;end;
if last.Code and date_num not in(.,"07JUL2014"d) and date le "07JUL2014"d then
do;
do until(date="08JUL2014"d) ;
Observed_Count_2014=0;
output;
date+1;
end;
end;
keep code date Observed_Count_2014;
run;
data have2;
format date date7.;
set dataset2;
by Code;
retain date;
if first.Code then date="01JUL2014"d;
do i=date to "07JUL2014"d while (date_num ne date);
Observed_Count_2012_2013=0;
output;
date+1;
;
end;
if date_num=date then do;Observed_Count_2012_2013=count_2012_2013;output;date+1;end;
if last.Code and date_num not in(.,"07JUL2014"d) and date le "07JUL2014"d then
do;
do until(date="08JUL2014"d) ;
Observed_Count_2012_2013=0;
output;
date+1;
end;
end;
keep code date Observed_Count_2012_2013;
run;
proc sql;
create table want as
select h1.code,h1.date,Observed_Count_2014,Observed_Count_2012_2013 from have1 as h1 inner join have2 as h2
on h1.code=h2.code and h1.date=h2.date;
quit;
Thank you Reeza, datasp, Xia, and Loko! I truly appreciate all your very helpful answers. I ended up using the second solution proposed by datasp, but I'm sure any of these would work well. It's great to see the different ways this can be achieved. Thanks again for the help!
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.