- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a dataset which is updated weekly in which there are rows missing depending on the week’s data. I’m trying to get each row to complete and produce a list of which states haven’t reported any data that week. There should be a line for each state for each week. Once I have a row for each of the 52 weeks for each year, I want to fill in the missing rainfall value with a 0. I’ve tried a number of array variations and haven’t been able to get this to work as the values aren’t exactly missing, it’s the whole row that doesn’t exist.
It would be a fairly simple complete step in R but I can’t get it to translate to SAS. The person I inherited this code from has been filling in the missing rows each week manually and I refuse to keep doing this for all 50ish ‘states’ (obviously this is a made up set).
For example right now I have,
State Year Week Rainfall
Illinois 2020 1 2
Illinois 2021 2 5
Illinois 2021 4 6
Missouri 2021 3 8
Missouri 2021 5 1
Missouri 2022 6 3
Missouri 2022 7 2
Wisconsin 2021 1 1
Wisconsin 2021 4 3
Wisconsin 2022 6 2
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your problem description is not sufficient to give a robust answer. In particular, are you creating a new dataset each week from two data sources (the year-to-date data and the new single-week data)? This would make it easy to determine what the maximum week number should be for each state. On the other hand, if the data are already in a single data set, per your sample, then a bit more work is require to make a program that determination of the miximum valid week number.
I'm going to assume you have two datasets: YTD (the historical data up to now) and SINGLE_WEEK (the most recent week of rainfall data)
Then this should work. It's untested, because we need sample data in the form of working data steps:
data default_rainfall (drop=_:) / view=default_rainfall ;
if _n_=1 then set single_week (keep=week rename=(week=_most_recent_week));
/* Now read one obs per state */
set ytd (keep=state);
by state;
retain rainfall 0;
if first.state then do week=1 to _most_recent_week;
output;
end;
run;
data want;
merge default_rainfall ytd single_week;
by state week;
run;
This program first creates a complete dummy history called default_rainfall, including the most recent week, but with all rainfall values set to zero.
Then a simple SAS merge statement does everything you need. But note that default_rainfall must be the first names dataset in the merge statement. That's because merge will overwrite the rainfall values from default_rainfall by values from the other two datasets, whenever there is a match of state and week. It doesn't matter what order you put the other two datasets in, assuming they current history and the new single week never have the same week present.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, that's helpful. Unfortunately it's a single dataset that's updated and imported each week. There's a macro to set the current and previous week numbers.
%let previous_wk=9;
%let current_wk=10;
Does this work for the data steps to reproduce?
data WORK.RAINFALL;
infile datalines dsd truncover;
input State:$9. Year:32. Week:32. Rainfall:32.;
label State="State" Year="Year" Week="Week" Rainfall="Rainfall";
datalines;
Illinois 2020 9 10
Illinois 2020 10 14
Illinois 2020 25 5
Illinois 2020 34 8
Illinois 2021 3 3
Illinois 2021 5 14
Illinois 2021 8 7
Illinois 2022 1 4
Missouri 2021 5 18
Missouri 2021 20 3
Missouri 2021 28 29
Missouri 2021 32 6
Missouri 2022 1 3
Missouri 2022 3 17
Missouri 2022 5 5
Wisconsin 2020 28 3
Wisconsin 2021 12 18
Wisconsin 2021 20 29
Wisconsin 2021 25 2
Wisconsin 2022 1 18
;;;;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here a SQL based approach. For this code to return your desired result you must have every single week for at least one state data.
data have;
infile datalines truncover;
input State $ Year Week Rainfall;
datalines;
Illinois 2020 1 2
Illinois 2021 2 5
Illinois 2021 4 6
Missouri 2021 3 8
Missouri 2021 5 1
Missouri 2022 6 3
Missouri 2022 7 2
Wisconsin 2021 1 1
Wisconsin 2021 4 3
Wisconsin 2022 6 2
;
proc sql;
create table distinct_state_week as
select
l.state,
r.week,
0 as rainfall
from
(select distinct state from have) l,
(select distinct week from have) r
;
quit;
proc sql;
create table want as
select
l.state,
l.week,
coalesce(r.rainfall, l.rainfall) as rainfall
from
distinct_state_week l
left join
have r
on l.state=r.state and l.week=r.week
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Just working with the one input data set, the following single data step should help.
data want;
set have;
by state year;
if first.year then
week_counter = 1;
if week_counter lt week then
do;
original_week = week;
original_rainfall = rainfall;
rainfall = 0;
do while (week_counter lt original_week);
week = week_counter;
output;
week_counter + 1;
end;
week = original_week;
rainfall = original_rainfall;
end;
output;
week_counter + 1;
if last.year and week_counter lt 52 then
do;
rainfall = 0;
do while (week_counter le 52);
week = week_counter;
output;
week_counter + 1;
end;
end;
run;
Kind regards,
Amir.
Edited to remove diagnostic statement from code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
If I have understood your newer requirements, then the following should help.
I have updated and revised the code to make use of two macro variables, for earliest and latest week, which I have set to 2 & 5 respectively. This does mean that year 2020 for Illinois is not in the results with the solution presented, but It is not clear to me what you want the results to look like. Please provide the exact results you want for the data in your question, for an earliest and latest week of 2 & 5, plus any other examples you think are pertinent.
Lastly I have also commented out the assignment of rainfall to 0 and replaced it with a missing assignment via the use of call missing(), as this will help distinguish between rainfall data that was not received versus a rainfall value of 0 that was received.
If you need something different then please supply examples of output data based on the data in your question along with values for earliest and latest week.
HTH
data have;
input
state : $char20.
year : 8.
week : 8.
rainfall : 8.
;
datalines;
Illinois 2020 1 2
Illinois 2021 2 5
Illinois 2021 4 6
Missouri 2021 3 8
Missouri 2021 5 1
Missouri 2022 6 3
Missouri 2022 7 2
Wisconsin 2021 1 1
Wisconsin 2021 4 3
Wisconsin 2022 6 2
;
%let earliest_week = 2;
%let current_week = 5;
data want;
set have(where = (week between &earliest_week and ¤t_week));
by state year;
if first.year then
week_counter = &earliest_week;
if week_counter lt week then
do;
original_week = week;
original_rainfall = rainfall;
/* rainfall = 0; */
call missing(rainfall);
do while (week_counter lt original_week);
week = week_counter;
output;
week_counter + 1;
end;
week = original_week;
rainfall = original_rainfall;
end;
output;
week_counter + 1;
if last.year and week lt ¤t_week then
do;
original_week = week;
original_rainfall = rainfall;
/* rainfall = 0; */
call missing(rainfall);
do while (week_counter le ¤t_week);
week = week_counter;
output;
week_counter + 1;
end;
end;
run;
Kind regards,
Amir.
Edit: Remove diagnostic statement from code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The example data I gave isn't great, sorry. I want it to start mid-2020 and extend until the current week.
So earliest.week would be like 20 and current.week is 10
There should be weeks 20-52 for 2020 for all states, all 52 weeks for 2021, and up until the current week for 2022. The dataset will be quite large for 52 weeks for each state so its hard to share the full example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Create a dynamic template for all states and weeks:
data states;
input state $30.;
datalines;
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Jersey
New Hampshire
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
;
%let begin=%sysevalf("17may2020"d);
data weeks;
date = &begin.;
do while (date lt today());
year = year(date);
week = week(date);
output;
date = date + 7;
end;
drop date;
run;
proc sql;
create table template as
select
state,
year,
week
from states, weeks
order by state, year, week
;
quit;
And then check your dataset against that:
proc sort data=have;
by state year week;
run;
data
want
miss
;
merge
have (in=h)
template (in=t)
;
by state year week;
if not h
then output miss;
output want; /* creates an observation with missing rainfall if not present in have */
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please supply usable example data (see @Ksharp 's code) that illustrates the issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I created an example dataset:
data have;
infile datalines dlm="," dsd;
input state :$30. year week rainfall;
datalines;
Alabama,2021,35,10
Alabama,2021,36,11
Alabama,2021,37,5
Alabama,2021,38,9
Alabama,2021,39,8
Alabama,2021,40,7
;
Then I ran my code against it, and printed a subset of want to show "Alabama" and the relevant weeks, which yields this result:
Beob. state year week rainfall 64 Alabama 2021 31 . 65 Alabama 2021 32 . 66 Alabama 2021 33 . 67 Alabama 2021 34 . 68 Alabama 2021 35 10 69 Alabama 2021 36 11 70 Alabama 2021 37 5 71 Alabama 2021 38 9 72 Alabama 2021 39 8 73 Alabama 2021 40 7 74 Alabama 2021 41 . 75 Alabama 2021 42 . 76 Alabama 2021 43 . 77 Alabama 2021 44 . 78 Alabama 2021 45 . 79 Alabama 2021 46 . 80 Alabama 2021 47 . 81 Alabama 2021 48 . 82 Alabama 2021 49 . 83 Alabama 2021 50 . 84 Alabama 2021 51 . 85 Alabama 2021 52 .
As you can see, none of the values from have have been changed in any way.
So it is ESSENTIAL to see your data as is. We need to know exactly what the types of the variables are, what the raw values are, and which formats (if present) are used to display them.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
When using this coding pattern place TEMPLATE before the real dataset in the MERGE statement.
That will prevent any extra variables that might have accidentally been added to the TEMPLATE dataset from overwriting the values read from the real dataset.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have;
infile datalines truncover;
input State $ Year Week Rainfall;
datalines;
Illinois 2020 1 2
Illinois 2021 2 5
Illinois 2021 4 6
Missouri 2021 3 8
Missouri 2021 5 1
Missouri 2022 6 3
Missouri 2022 7 2
Wisconsin 2021 1 1
Wisconsin 2021 4 3
Wisconsin 2022 6 2
;
proc freq data=have noprint;
table State*Year*Week/out=want sparse;
weight Rainfall;
run;