Hello. I've been researching a solution to this question for a couple of days and I haven't found anything as specific as what I need. I'm not a succinct coder, and I am somewhere between novice and intermediate. I'd really appreciate help with the following:
I have a dataset that should have 7800 observations (25 census tracts x 26 years x 12 month/years).
It has 5,704 observations because there are month/years where--in the original crime dataset-- the crime count for that month/yr and tract was 0. (The counts here are total crime including different UCR hierarchy codes.) The sample below has 7 sample rows. Everything to the right of crimecount is a demographic variable from the US census.
The census data is from three decennial censuses-1990 population data (trctpop) is applied from 1990-1994, 2000 data is applied from 1995-2004, and 2010 data is applied from 2005-2015.
I want to fill in the missing month/years with all of the identical census population data and demographic information for the appropriate month/year and census tract, and add a 0 for crimecount. So for example, between March and May 1990 below, a new observation would be added, identical to these two observations, except the month/year would say April1990 and there would be a 0 in the crimecount column.
As I began coding the dataset below, I wanted month and year, so o_moyr in the table is formatted as dtmonyy7. The problem is that SAS "sees" the day in the date, and I couldn't drop duplicate month/yrs because SAS wasn't just looking at the month and year, but also the day. So I created o_monyrc as a categorical version of o_monyrc and used the nodupkey option to get SAS to see beyond the day to drop duplicate month/years. The problem, of course is that SAS orders o_monyrc starting with April, so I have to sort by o_moyr to keep the dataset ordered properly. As I've tried to figure out adding in the missing month/years and 0's for crimecount, the date formatting issue continues to be a problem because SAS still sees the day in o_moyr. I'd also appreciate suggestions for how to deal with that. Perhaps I could extract the month and the year into different columns from the get go and drop o_monyrc?
Thank you for any and all help.
o_monyrc | o_moyr | tract | trctpop | crimecount | shrblk | numhhs | fmcn | ffhn | ffhd | educ8 | educ11 |
JAN1990 | JAN1990 | 305 | 2583 | 8 | 2493 | 1133 | 44 | 191 | 257 | 295 | 541 |
FEB1990 | FEB1990 | 305 | 2583 | 4 | 2493 | 1133 | 44 | 191 | 257 | 295 | 541 |
MAR1990 | MAR1990 | 305 | 2583 | 1 | 2493 | 1133 | 44 | 191 | 257 | 295 | 541 |
MAY1990 | MAY1990 | 305 | 2583 | 3 | 2493 | 1133 | 44 | 191 | 257 | 295 | 541 |
JUN1990 | JUN1990 | 305 | 2583 | 11 | 2493 | 1133 | 44 | 191 | 257 | 295 | 541 |
JUL1990 | JUL1990 | 305 | 2583 | 7 | 2493 | 1133 | 44 | 191 | 257 | 295 | 541 |
AUG1990 | AUG1990 | 305 | 2583 | 7 | 2493 | 1133 | 44 | 191 | 257 | 295 | 541 |
I'm not sure if I fully understand (yet), but have you tried the INTNX function with the 'sameday' option? With the 'sameday' option, you can increment forward (or back), one month at a time, and the day will remain constant (test carefully if you use 28, 29, 30, or 31 as your day).
Jim
@DID wrote:
Thank you Jim. How do I know which day SAS is seeing to know which "sameday" SAS is using? When I used the nodupkey option to drop duplicate categorical month/days in order to get the dataset the sample comes from, I don't know which of the days it dropped from among the numerical month/yrs.
What I would do is extract the months and years and create a new column that is a combination of year and month. I'm going to put some code below, and you can see me do that. Then do your Sort with a Nodupkey. After you've done your Sort Nodupkey, you don't care which day it is (again, see code, below), because we're going to
Results:
Caveats: My code is not all of what you need, but you should be able to learn from it and be able to tweak it so that is exactly what you need. For example, this code will only catch one missing month. You'd actually have to create a loop for the case that you need to create more than one missing month. For example, if you have Jan, Feb, Mar, and May, this code will fill in Apr for you. But if you have Jan, Feb, Mar, and Jun, this code would fill in Apr but miss May. You'll need to adjust accordingly.
Take a look at the code. Try some of the techniques I'm giving you examples of, and then let's discuss further when questions or problems arise.
Jim
DATA Raw_Months;
DROP _:;
RETAIN _First_Time 1;
/* FORMAT o_monyrc DATE9.*/
/* o_moyr DATE9.*/
/* _Desired_Date DATE9.*/
/* _prev_o_monyrc DATE9.*/
/* _prev_o_moyr DATE9.*/
/* _save_o_monyrc DATE9.*/
/* _save_o_moyr DATE9.*/
/* ;*/
FORMAT o_monyrc MONYY7.
o_moyr MONYY7.
_Desired_Date MONYY7.
_prev_o_monyrc MONYY7.
_prev_o_moyr MONYY7.
_save_o_monyrc MONYY7.
_save_o_moyr MONYY7.
;
LENGTH
_o_monyrc $7
_o_moyr $7
tract $3
trctpop $4
crimecount $2
shrblk $4
numhhs $4
fmcn $2
ffhn $3
ffhd $3
educ8 $3
educ11 $3
;
LENGTH
_prev_o_monyrc 8.
_prev_o_moyr 8.
_prev_tract $3
_prev_trctpop $4
_prev_crimecount $2
_prev_shrblk $4
_prev_numhhs $4
_prev_fmcn $2
_prev_ffhn $3
_prev_ffhd $3
_prev_educ8 $3
_prev_educ11 $3
;
LENGTH
_save_o_monyrc 8.
_save_o_moyr 8.
_save_tract $3
_save_trctpop $4
_save_crimecount $2
_save_shrblk $4
_save_numhhs $4
_save_fmcn $2
_save_ffhn $3
_save_ffhd $3
_save_educ8 $3
_save_educ11 $3
;
RETAIN
_prev_o_monyrc
_prev_o_moyr
_prev_tract
_prev_trctpop
_prev_crimecount
_prev_shrblk
_prev_numhhs
_prev_fmcn
_prev_ffhn
_prev_ffhd
_prev_educ8
_prev_educ11
;
INFILE DATALINES4 DSD DLM='09'X;
INPUT
_o_monyrc $
_o_moyr $
tract $
trctpop $
crimecount $
shrblk $
numhhs $
fmcn $
ffhn $
ffhd $
educ8 $
educ11 $
;
o_monyrc = INPUT(_o_monyrc, ANYDTDTE7.);
o_moyr = INPUT(_o_moyr, ANYDTDTE7.);
IF _First_Time THEN
DO;
PUTLOG "&Nte1 Setting _First_Time to 0";
_First_Time = 0;
END;
ELSE
DO;
_Desired_Date = INTNX('MONTH', _prev_o_monyrc, 1, 'SAMEDAY');
_Desired_Month = MONTH(_Desired_Date);
_Desired_Year = YEAR(_Desired_Date);
_Actual_Month = MONTH(o_monyrc);
_Actual_Year = YEAR(o_monyrc);
_Actual_YYMM = (_Actual_Year * 100) + _Actual_Month;
_Desired_YYMM = (_Desired_Year * 100) + _Desired_Month;
PUTLOG "&Nte1 Desired Year is " _Desired_Year;
PUTLOG "&Nte1 Desired Month is " _Desired_Month;
PUTLOG "&Nte1 Desired YYMM is " _Desired_YYMM;
PUTLOG "&Nte1 Actual Year is " _Actual_Year;
PUTLOG "&Nte1 Actual Month is " _Actual_Month;
PUTLOG "&Nte1 Actual YYMM is " _Actual_YYMM;
IF _Actual_YYMM > _Desired_YYMM THEN
DO;
PUTLOG "&Nte1 Adding a missing month " _Desired_Month=;
_save_o_monyrc = o_monyrc ;
_save_o_moyr = o_moyr ;
_save_tract = tract ;
_save_trctpop = trctpop ;
_save_crimecount = crimecount ;
_save_shrblk = shrblk ;
_save_numhhs = numhhs ;
_save_fmcn = fmcn ;
_save_ffhn = ffhn ;
_save_ffhd = ffhd ;
_save_educ8 = educ8 ;
_save_educ11 = educ11 ;
o_monyrc = _Desired_Date ;
o_moyr = _Desired_Date ;
tract = _prev_tract ;
trctpop = _prev_trctpop ;
crimecount = '0' ;
shrblk = _prev_shrblk ;
numhhs = _prev_numhhs ;
fmcn = _prev_fmcn ;
ffhn = _prev_ffhn ;
ffhd = _prev_ffhd ;
educ8 = _prev_educ8 ;
educ11 = _prev_educ11 ;
OUTPUT;
o_monyrc = _save_o_monyrc ;
o_moyr = _save_o_moyr ;
tract = _save_tract ;
trctpop = _save_trctpop ;
crimecount = _save_crimecount ;
shrblk = _save_shrblk ;
numhhs = _save_numhhs ;
fmcn = _save_fmcn ;
ffhn = _save_ffhn ;
ffhd = _save_ffhd ;
educ8 = _save_educ8 ;
educ11 = _save_educ11 ;
END;
END;
_prev_o_monyrc = o_monyrc ;
_prev_o_moyr = o_moyr ;
_prev_tract = tract ;
_prev_trctpop = trctpop ;
_prev_crimecount = crimecount ;
_prev_shrblk = shrblk ;
_prev_numhhs = numhhs ;
_prev_fmcn = fmcn ;
_prev_ffhn = ffhn ;
_prev_ffhd = ffhd ;
_prev_educ8 = educ8 ;
_prev_educ11 = educ11 ;
OUTPUT;
DATALINES4;
JAN1990 JAN1990 305 2583 8 2493 1133 44 191 257 295 541
FEB1990 FEB1990 305 2583 4 2493 1133 44 191 257 295 541
MAR1990 MAR1990 305 2583 1 2493 1133 44 191 257 295 541
MAY1990 MAY1990 305 2583 3 2493 1133 44 191 257 295 541
JUN1990 JUN1990 305 2583 11 2493 1133 44 191 257 295 541
JUL1990 JUL1990 305 2583 7 2493 1133 44 191 257 295 541
AUG1990 AUG1990 305 2583 7 2493 1133 44 191 257 295 541
;;;;
RUN;
What does the census data have for the Date component?
And are you actually using one Census data set annual value to 10 years of your data? I would look around the census website for one of the other sets that has annual estimates.
This might be as easy as expanding a yearly "date" to match your dates in the crime data.
An example assuming you have a date in the crime data that represents the first day of the month.
data monthdata; do year= 1990 to 2015; do month=1 to 12; date= mdy(month,1,year); /* simulate a value for a variable*/ crimevar = rand('integer',1000); /* simulate missing date*/ if rand('integer',10) le 9 then output; end; end; format date monyy7.; drop month year; run; data censusdata; /* assumes one value per data year*/ input cyear somevalue; datalines; 1990 1234 2000 3456 2010 4567 ; data censusmerge; set censusdata; select (cyear); when(1990) do; syear=1990; eyear=1994; end; when(2000) do; syear=1995; eyear=2004; end; when(2010) do; syear=2005; eyear=2015; end; otherwise; end; do year=syear to eyear; do month=1 to 12; date=mdy(month,1,year); output; end; end; format date monyy7.; drop month year syear eyear; run; data combined; merge monthdata censusmerge; by date; if missing(crimevar) then crimevar=0; run;
Your data sets would have the Census Tract information in both the "monthdata" and census data and the BY statement in the last step would use By CensusTract Date; (or Date Censustract, you likely need to sort both sets before merging.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.