Solved: Re: Avoid observations being counted twice in PROC TABULATE

halkyos · Posted 10-16-2019 11:09 AM

I am working with some 2015-2017 FARS data to analyze traffic deaths by county. One of the subjects of interest is fatalities which occurred in a crash where at least one driver had a BAC above the legal limit. My code works well when an accident only has one driver over the limit, but when there is more than one the fatalities start getting counted multiple times. The following code example sets up some fake data with the actual code I am using in PROC TABULATE (aside from the dataset name and the title):

DATA test;
	input casenum persnum numfatal county $ driver alcdriver;
datalines;
1 1 2 A 1 1
1 2 2 A 1 0
1 3 2 A 1 1
1 4 2 A 0 0
1 5 2 A 0 0
2 1 3 B 1 1
2 2 3 B 0 0
2 3 3 B 0 0
3 1 1 B 1 1
3 2 1 B 1 1
4 1 1 B 1 0
;
RUN;
PROC TABULATE DATA=test;
	TITLE 'TEST';
	CLASS county;
	VAR numfatal;
	TABLE county, SUM*(numfatal);
	WHERE alcdriver=1;
RUN;

The variables are the same variable names I am using in the actual data and they represent the following:

casenum- A sequential ordering of crashes. This is the same for each person involved in the crash.
persnum- A sequential ordering of each person per crash. This begins at 1 and resets for each crash.
numfatal- The number of fatalities in the crash, this is the same for each person involved in the crash.
county- The county where the crash occurred.
driver- if the person was a driver (1) or not (0).
alcdriver- if the person was a driver AND over the legal limit (1) or not (0).

So looking at the data (not the PROC TABULATE results) we see that county A had one crash (casenum 1) involving five people, two of whom were drunk drivers, resulting in 2 fatalities. County B had 3 crashes: Casenum 2 involved three people with only one drunk driver and three deaths. Casenum 3 involved two people, both drunk drivers, and one death. Casenum 4 had a death, but the driver wasn't drunk.

What I should be seeing as an result of summarizing the number of fatalities by county where alcdriver=1 is County A: 2 and County B: 4

What I am seeing is:

So it appears that the deaths are getting counted for each drunk driver instead of only once per case. How do I fix this?

-------------------------------------------------------------------------

Here is my log:

1
2 DATA test;
3 input casenum persnum numfatal county $ driver alcdriver;
4 datalines;

NOTE: The data set WORK.TEST has 11 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.03 seconds


16 ;
17 RUN;
18 PROC TABULATE DATA=test;
NOTE: Writing HTML Body file: sashtml.htm
19 TITLE 'TEST';
20 CLASS county;
21 VAR numfatal;
22 TABLE county, SUM*(numfatal);
23 WHERE alcdriver=1;
24 RUN;

NOTE: There were 5 observations read from the data set WORK.TEST.
WHERE alcdriver=1;
NOTE: PROCEDURE TABULATE used (Total process time):
real time 0.50 seconds
cpu time 0.40 seconds

halkyos · Posted 10-16-2019 12:32 PM

I actually just figured it out. Since the PROC SQL step gave me a data set where each observation was identical and had the total number of fatalities, people involved, and number of drivers over the legal limit, I was able to take that data set and remove duplicates. Below is the full code. This probably could have been done more efficiently, but it does what I need it to do.

DATA test;
	input casenum persnum numfatal county $ driver alcdriver;
datalines;
1 1 2 A 1 1
1 2 2 A 1 0
1 3 2 A 1 1
1 4 2 A 0 0
1 5 2 A 0 0
2 1 3 B 1 1
2 2 3 B 0 0
2 3 3 B 0 0
3 1 1 B 1 1
3 2 1 B 1 1
4 1 1 B 1 0
;
RUN;
PROC TABULATE DATA=test;
	TITLE 'TEST';
	CLASS county;
	VAR numfatal;
	TABLE county, SUM*(numfatal);
	WHERE alcdriver=1;
RUN;
PROC SQL;
	CREATE TABLE test2 as
	SELECT casenum as casenum,
	max(persnum) as persnum,
	max(numfatal) as numfatal,
	county as county,
	sum(alcdriver) as alcdriver
	FROM test
	GROUP BY casenum;
RUN;
QUIT;
PROC SORT DATA=test2;
	BY casenum;
RUN;
DATA test3;
	SET test2;
	BY casenum;
		IF first.casenum;
RUN;
PROC TABULATE DATA=test3;
	TITLE 'TEST 2';
	CLASS county;
	VAR numfatal;
	TABLE county, SUM*(numfatal);
	WHERE alcdriver GT 0;
RUN;

View solution in original post

ballardw · Posted 10-16-2019 11:53 AM

@halkyos wrote:

I am working with some 2015-2017 FARS data to analyze traffic deaths by county. One of the subjects of interest is fatalities which occurred in a crash where at least one driver had a BAC above the legal limit. My code works well when an accident only has one driver over the limit, but when there is more than one the fatalities start getting counted multiple times. The following code example sets up some fake data with the actual code I am using in PROC TABULATE (aside from the dataset name and the title):

The variables are the same variable names I am using in the actual data and they represent the following:

casenum- A sequential ordering of crashes. This is the same for each person involved in the crash.

persnum- A sequential ordering of each person per crash. This begins at 1 and resets for each crash.

numfatal- The number of fatalities in the crash, this is the same for each person involved in the crash.

county- The county where the crash occurred.

driver- if the person was a driver (1) or not (0).

alcdriver- if the person was a driver AND over the legal limit (1) or not (0).

Your data is structured incorrectly. If you have a record for each person you should have an indicator that the specific person was killed and each record should have the "any driver" alcohol indicator.

Or only have one record per car and the number of fatalities in that car.

And since you are dealing with a per casenum for determining "any driver" with BAC above limit then ALL of the records for any crash should have the same Alcdriver value. Your Alcdriver coding seems to be going toward the "per car" approach but since you have the fatalities counted in multiple cars you need to go back a step for how the numfatal were assigned.

halkyos · Posted 10-16-2019 12:20 PM

How do you suggest I correct this? As far as the FARS data itself goes, it is a dataset from the National Highway Traffic Safety Administration so the initial structure is out of my control. The only variable that was created after the fact is AlcDriver (and further on will include one for drivers who tested positive for other substances). It isn't possible to identify which of the people in the crash was the fatality (well it is, but that requires reading several hundred police reports when dealing with the full data).

I tried the following, but wound up with A=10 and B=11:

PROC SQL;
	CREATE TABLE test2 as
	SELECT casenum as casenum,
	max(persnum) as persnum,
	max(numfatal) as numfatal,
	county as county,
	sum(alcdriver) as alcdriver
	FROM test
	GROUP BY casenum;
RUN;
QUIT;
PROC TABULATE DATA=test2;
	TITLE 'TEST 2';
	CLASS county;
	VAR numfatal;
	TABLE county, SUM*(numfatal);
	WHERE alcdriver GT 0;
RUN;

halkyos · Posted 10-16-2019 12:32 PM

I actually just figured it out. Since the PROC SQL step gave me a data set where each observation was identical and had the total number of fatalities, people involved, and number of drivers over the legal limit, I was able to take that data set and remove duplicates. Below is the full code. This probably could have been done more efficiently, but it does what I need it to do.

DATA test;
	input casenum persnum numfatal county $ driver alcdriver;
datalines;
1 1 2 A 1 1
1 2 2 A 1 0
1 3 2 A 1 1
1 4 2 A 0 0
1 5 2 A 0 0
2 1 3 B 1 1
2 2 3 B 0 0
2 3 3 B 0 0
3 1 1 B 1 1
3 2 1 B 1 1
4 1 1 B 1 0
;
RUN;
PROC TABULATE DATA=test;
	TITLE 'TEST';
	CLASS county;
	VAR numfatal;
	TABLE county, SUM*(numfatal);
	WHERE alcdriver=1;
RUN;
PROC SQL;
	CREATE TABLE test2 as
	SELECT casenum as casenum,
	max(persnum) as persnum,
	max(numfatal) as numfatal,
	county as county,
	sum(alcdriver) as alcdriver
	FROM test
	GROUP BY casenum;
RUN;
QUIT;
PROC SORT DATA=test2;
	BY casenum;
RUN;
DATA test3;
	SET test2;
	BY casenum;
		IF first.casenum;
RUN;
PROC TABULATE DATA=test3;
	TITLE 'TEST 2';
	CLASS county;
	VAR numfatal;
	TABLE county, SUM*(numfatal);
	WHERE alcdriver GT 0;
RUN;

ballardw · Posted 10-16-2019 12:55 PM

So you created a single record per accident solution.

I'm glad that you found a solution.

Note that you have one very unlikely data case to consider, and may not ever occur in your data but for learning purposes is worth considering.

Consider a crash that occurs on a county boundary. It is not impossible that a recording officer may indicate that one car is in one county and the other in a different county. This scenario is not completely unlikely in the case of multiple car pileups such as winter driving conditions where you might have quite a few vehicles in the same "crash".

So you may need to consider some logic change to:

PROC SQL;
	CREATE TABLE test2 as
	SELECT casenum as casenum,
	max(persnum) as persnum,
	max(numfatal) as numfatal,
	county as county,
	sum(alcdriver) as alcdriver
	FROM test
	GROUP BY casenum;

QUIT;

as to exactly which county you get with that code in the multiple vehicle scenarios.

ballardw · Posted 10-16-2019 01:23 PM

@halkyos wrote:

How do you suggest I correct this? As far as the FARS data itself goes, it is a dataset from the National Highway Traffic Safety Administration so the initial structure is out of my control. The only variable that was created after the fact is AlcDriver (and further on will include one for drivers who tested positive for other substances). It isn't possible to identify which of the people in the crash was the fatality (well it is, but that requires reading several hundred police reports when dealing with the full data).

I tried the following, but wound up with A=10 and B=11:

When I was working with traffic crash statistics it was from data read from the tapes (yes tapes) provided by the state police and we had separate files for an overall crash summaray, per vehicle and per person. So we tended to work on the level of data needed to answer an question and combine fields based on the crash and/vehicle info as needed.

I suspect this may have predated the FARS data structures currently available.

General approach is to summarize as needed by crash or perhaps vehicle. which you have determined one way.

Avoid observations being counted twice in PROC TABULATE

Re: Avoid observations being counted twice in PROC TABULATE

Re: Avoid observations being counted twice in PROC TABULATE

Re: Avoid observations being counted twice in PROC TABULATE

Re: Avoid observations being counted twice in PROC TABULATE

Re: Avoid observations being counted twice in PROC TABULATE

Re: Avoid observations being counted twice in PROC TABULATE

Catch up on SAS Innovate 2026