Re: Time-series data with time gaps that need to be filled

DID · Posted 05-26-2021 12:15 PM

Hello. I've been researching a solution to this question for a couple of days and I haven't found anything as specific as what I need. I'm not a succinct coder, and I am somewhere between novice and intermediate. I'd really appreciate help with the following:

I have a dataset that should have 7800 observations (25 census tracts x 26 years x 12 month/years).

It has 5,704 observations because there are month/years where--in the original crime dataset-- the crime count for that month/yr and tract was 0. (The counts here are total crime including different UCR hierarchy codes.) The sample below has 7 sample rows. Everything to the right of crimecount is a demographic variable from the US census.

The census data is from three decennial censuses-1990 population data (trctpop) is applied from 1990-1994, 2000 data is applied from 1995-2004, and 2010 data is applied from 2005-2015.

I want to fill in the missing month/years with all of the identical census population data and demographic information for the appropriate month/year and census tract, and add a 0 for crimecount. So for example, between March and May 1990 below, a new observation would be added, identical to these two observations, except the month/year would say April1990 and there would be a 0 in the crimecount column.

As I began coding the dataset below, I wanted month and year, so o_moyr in the table is formatted as dtmonyy7. The problem is that SAS "sees" the day in the date, and I couldn't drop duplicate month/yrs because SAS wasn't just looking at the month and year, but also the day. So I created o_monyrc as a categorical version of o_monyrc and used the nodupkey option to get SAS to see beyond the day to drop duplicate month/years. The problem, of course is that SAS orders o_monyrc starting with April, so I have to sort by o_moyr to keep the dataset ordered properly. As I've tried to figure out adding in the missing month/years and 0's for crimecount, the date formatting issue continues to be a problem because SAS still sees the day in o_moyr. I'd also appreciate suggestions for how to deal with that. Perhaps I could extract the month and the year into different columns from the get go and drop o_monyrc?

Thank you for any and all help.

o_monyrc	o_moyr	tract	trctpop	crimecount	shrblk	numhhs	fmcn	ffhn	ffhd	educ8	educ11
JAN1990	JAN1990	305	2583	8	2493	1133	44	191	257	295	541
FEB1990	FEB1990	305	2583	4	2493	1133	44	191	257	295	541
MAR1990	MAR1990	305	2583	1	2493	1133	44	191	257	295	541
MAY1990	MAY1990	305	2583	3	2493	1133	44	191	257	295	541
JUN1990	JUN1990	305	2583	11	2493	1133	44	191	257	295	541
JUL1990	JUL1990	305	2583	7	2493	1133	44	191	257	295	541
AUG1990	AUG1990	305	2583	7	2493	1133	44	191	257	295	541

jimbarbour · Posted 05-26-2021 12:22 PM

I'm not sure if I fully understand (yet), but have you tried the INTNX function with the 'sameday' option? With the 'sameday' option, you can increment forward (or back), one month at a time, and the day will remain constant (test carefully if you use 28, 29, 30, or 31 as your day).

Jim

DID · Posted 05-26-2021 12:37 PM

Thank you Jim. How do I know which day SAS is seeing to know which "sameday" SAS is using? When I used the nodupkey option to drop duplicate categorical month/days in order to get the dataset the sample comes from, I don't know which of the days it dropped from among the numerical month/yrs. Here's some of the original code, if that helps.

*Turn o_moyr into a character variable to aid in grouping crime count;
data crime.crime3;
set crime.crime2;
o_monyrc = put(o_year, dtmonyy7.);
run;

(lots of code...)

data main.dataset5_a ;
set main.dataset5;
where tract in (305, 402,403,501,506,708,902,1005,1017,1106,1113,1114,1115,1203,
1204,1207,1301,1302,1304, 1404, 1405, 1603, 1607, 2807, 2902);
run;
*19043 observations, 57 variables;

proc sort data=main.dataset5_a out=main.dataset5_a2 nodupkey;
by tract o_monyrc;
run;
*5704 observations, 57 variables;

proc sort data=main.dataset5_a2;
by tract o_moyr;
run;

jimbarbour · Posted 05-26-2021 01:41 PM

@DID wrote:
Thank you Jim. How do I know which day SAS is seeing to know which "sameday" SAS is using? When I used the nodupkey option to drop duplicate categorical month/days in order to get the dataset the sample comes from, I don't know which of the days it dropped from among the numerical month/yrs.

What I would do is extract the months and years and create a new column that is a combination of year and month. I'm going to put some code below, and you can see me do that. Then do your Sort with a Nodupkey. After you've done your Sort Nodupkey, you don't care which day it is (again, see code, below), because we're going to

Extract the year and month
Concatenate the year and month into a YYMM field
Compare the YYMM from the previous row with what we actually get with the current row. When done this way, the day of the month does not matter.

Results:

Caveats: My code is not all of what you need, but you should be able to learn from it and be able to tweak it so that is exactly what you need. For example, this code will only catch one missing month. You'd actually have to create a loop for the case that you need to create more than one missing month. For example, if you have Jan, Feb, Mar, and May, this code will fill in Apr for you. But if you have Jan, Feb, Mar, and Jun, this code would fill in Apr but miss May. You'll need to adjust accordingly.

Take a look at the code. Try some of the techniques I'm giving you examples of, and then let's discuss further when questions or problems arise.

Jim

DATA	Raw_Months;
	DROP	_:;

	RETAIN	_First_Time	1;

/*	FORMAT	o_monyrc		DATE9.*/
/*			o_moyr			DATE9.*/
/*			_Desired_Date	DATE9.*/
/*			_prev_o_monyrc	DATE9.*/
/*			_prev_o_moyr	DATE9.*/
/*			_save_o_monyrc	DATE9.*/
/*			_save_o_moyr	DATE9.*/
/*			;*/
	FORMAT	o_monyrc		MONYY7.
			o_moyr			MONYY7.
			_Desired_Date	MONYY7.
			_prev_o_monyrc	MONYY7.
			_prev_o_moyr	MONYY7.
			_save_o_monyrc	MONYY7.
			_save_o_moyr	MONYY7.
			;

	LENGTH
		_o_monyrc			$7
		_o_moyr				$7
		tract				$3
		trctpop				$4
		crimecount			$2
		shrblk				$4
		numhhs				$4
		fmcn				$2
		ffhn				$3
		ffhd				$3
		educ8				$3
		educ11				$3
		;

	LENGTH
		_prev_o_monyrc		8.
		_prev_o_moyr		8.
		_prev_tract			$3
		_prev_trctpop		$4
		_prev_crimecount	$2
		_prev_shrblk		$4
		_prev_numhhs		$4
		_prev_fmcn			$2
		_prev_ffhn			$3
		_prev_ffhd			$3
		_prev_educ8			$3
		_prev_educ11		$3
		;

	LENGTH
		_save_o_monyrc		8.
		_save_o_moyr		8.
		_save_tract			$3
		_save_trctpop		$4
		_save_crimecount	$2
		_save_shrblk		$4
		_save_numhhs		$4
		_save_fmcn			$2
		_save_ffhn			$3
		_save_ffhd			$3
		_save_educ8			$3
		_save_educ11		$3
		;

	RETAIN
		_prev_o_monyrc	
		_prev_o_moyr		
		_prev_tract		
		_prev_trctpop	
		_prev_crimecount
		_prev_shrblk	
		_prev_numhhs	
		_prev_fmcn		
		_prev_ffhn		
		_prev_ffhd		
		_prev_educ8		
		_prev_educ11	
		;

	INFILE	DATALINES4	DSD	DLM='09'X;
	INPUT
		_o_monyrc			$
		_o_moyr				$
		tract				$
		trctpop				$
		crimecount			$
		shrblk				$
		numhhs				$
		fmcn				$
		ffhn				$
		ffhd				$
		educ8				$
		educ11				$
		;

	o_monyrc				=	INPUT(_o_monyrc,	ANYDTDTE7.);
	o_moyr					=	INPUT(_o_moyr,		ANYDTDTE7.);

	IF	_First_Time								THEN
		DO;
			PUTLOG	"&Nte1  Setting _First_Time to 0";
			_First_Time		=	0;
		END;
	ELSE
		DO;
			_Desired_Date	=	INTNX('MONTH', _prev_o_monyrc, 1, 'SAMEDAY');
			_Desired_Month	=	MONTH(_Desired_Date);
			_Desired_Year	=	YEAR(_Desired_Date);
			_Actual_Month	=	MONTH(o_monyrc);
			_Actual_Year	=	YEAR(o_monyrc);
			_Actual_YYMM	=	(_Actual_Year  * 100)	+	_Actual_Month;
			_Desired_YYMM	=	(_Desired_Year * 100)	+	_Desired_Month;
			PUTLOG	"&Nte1  Desired Year is  "	_Desired_Year;
			PUTLOG	"&Nte1  Desired Month is "	_Desired_Month;
			PUTLOG	"&Nte1  Desired YYMM is  "	_Desired_YYMM;
			PUTLOG	"&Nte1  Actual Year is   "	_Actual_Year;
			PUTLOG	"&Nte1  Actual Month is  "	_Actual_Month;
			PUTLOG	"&Nte1  Actual YYMM is   "	_Actual_YYMM;
			IF	_Actual_YYMM				>	_Desired_YYMM	THEN
				DO;
					PUTLOG	"&Nte1  Adding a missing month "	_Desired_Month=;
					_save_o_monyrc			=	o_monyrc			;
					_save_o_moyr			=	o_moyr				;
					_save_tract				=	tract				;
					_save_trctpop			=	trctpop				;
					_save_crimecount		=	crimecount			;
					_save_shrblk			=	shrblk				;
					_save_numhhs			=	numhhs				;
					_save_fmcn				=	fmcn				;
					_save_ffhn				=	ffhn				;
					_save_ffhd				=	ffhd				;
					_save_educ8				=	educ8				;
					_save_educ11			=	educ11				;
					o_monyrc				=	_Desired_Date		;
					o_moyr					=	_Desired_Date		;
					tract					=	_prev_tract			;
					trctpop					=	_prev_trctpop		;
					crimecount				=	'0'					;
					shrblk					=	_prev_shrblk		;
					numhhs					=	_prev_numhhs		;
					fmcn					=	_prev_fmcn			;
					ffhn					=	_prev_ffhn			;
					ffhd					=	_prev_ffhd			;
					educ8					=	_prev_educ8			;
					educ11					=	_prev_educ11		;
					OUTPUT;
					o_monyrc				=	_save_o_monyrc		;
					o_moyr					=	_save_o_moyr		;
					tract					=	_save_tract			;
					trctpop					=	_save_trctpop		;
					crimecount				=	_save_crimecount	;
					shrblk					=	_save_shrblk		;
					numhhs					=	_save_numhhs		;
					fmcn					=	_save_fmcn			;
					ffhn					=	_save_ffhn			;
					ffhd					=	_save_ffhd			;
					educ8					=	_save_educ8			;
					educ11					=	_save_educ11		;
				END;
		END;

	_prev_o_monyrc			=	o_monyrc	;
	_prev_o_moyr			=	o_moyr		;
	_prev_tract				=	tract		;
	_prev_trctpop			=	trctpop		;
	_prev_crimecount		=	crimecount	;
	_prev_shrblk			=	shrblk		;
	_prev_numhhs			=	numhhs		;
	_prev_fmcn				=	fmcn		;
	_prev_ffhn				=	ffhn		;
	_prev_ffhd				=	ffhd		;
	_prev_educ8				=	educ8		;
	_prev_educ11			=	educ11		;

	OUTPUT;

DATALINES4;
JAN1990	JAN1990	305	2583	8	2493	1133	44	191	257	295	541
FEB1990	FEB1990	305	2583	4	2493	1133	44	191	257	295	541
MAR1990	MAR1990	305	2583	1	2493	1133	44	191	257	295	541
MAY1990	MAY1990	305	2583	3	2493	1133	44	191	257	295	541
JUN1990	JUN1990	305	2583	11	2493	1133	44	191	257	295	541
JUL1990	JUL1990	305	2583	7	2493	1133	44	191	257	295	541
AUG1990	AUG1990	305	2583	7	2493	1133	44	191	257	295	541
;;;;
RUN;

ballardw · Posted 05-26-2021 01:02 PM

What does the census data have for the Date component?

And are you actually using one Census data set annual value to 10 years of your data? I would look around the census website for one of the other sets that has annual estimates.

This might be as easy as expanding a yearly "date" to match your dates in the crime data.

An example assuming you have a date in the crime data that represents the first day of the month.

data monthdata;
   do year= 1990 to 2015;
      do month=1 to 12;
         date= mdy(month,1,year);
         /* simulate a value for a variable*/
         crimevar = rand('integer',1000);
         /* simulate missing date*/
         if rand('integer',10) le 9 then output;
      end;
   end;
   format date  monyy7.;
   drop month year;
run;

data censusdata;
   /* assumes one value per data year*/
   input cyear somevalue;
datalines;
1990  1234
2000  3456
2010  4567
;

data censusmerge;
   set censusdata;
   select (cyear);
      when(1990) do;
                  syear=1990;
                  eyear=1994;
                 end;
      when(2000) do;
                  syear=1995;
                  eyear=2004;
                 end;
      when(2010) do;
                  syear=2005;
                  eyear=2015;
                 end;
      otherwise;
   end;
   do year=syear to eyear;
      do month=1 to 12;
         date=mdy(month,1,year);
         output;
      end;
   end;
   format date monyy7.;
   drop month year syear eyear;
run;

data combined;
   merge monthdata censusmerge;
   by date;
   if missing(crimevar) then crimevar=0;
run;

Your data sets would have the Census Tract information in both the "monthdata" and census data and the BY statement in the last step would use By CensusTract Date; (or Date Censustract, you likely need to sort both sets before merging.

Time-series data with time gaps that need to be filled