About halkyos

halkyos · ‎10-16-2019

I actually just figured it out. Since the PROC SQL step gave me a data set where each observation was identical and had the total number of fatalities, people involved, and number of drivers over the legal limit, I was able to take that data set and remove duplicates. Below is the full code. This probably could have been done more efficiently, but it does what I need it to do. DATA test; input casenum persnum numfatal county $ driver alcdriver; datalines; 1 1 2 A 1 1 1 2 2 A 1 0 1 3 2 A 1 1 1 4 2 A 0 0 1 5 2 A 0 0 2 1 3 B 1 1 2 2 3 B 0 0 2 3 3 B 0 0 3 1 1 B 1 1 3 2 1 B 1 1 4 1 1 B 1 0 ; RUN; PROC TABULATE DATA=test; TITLE 'TEST'; CLASS county; VAR numfatal; TABLE county, SUM*(numfatal); WHERE alcdriver=1; RUN; PROC SQL; CREATE TABLE test2 as SELECT casenum as casenum, max(persnum) as persnum, max(numfatal) as numfatal, county as county, sum(alcdriver) as alcdriver FROM test GROUP BY casenum; RUN; QUIT; PROC SORT DATA=test2; BY casenum; RUN; DATA test3; SET test2; BY casenum; IF first.casenum; RUN; PROC TABULATE DATA=test3; TITLE 'TEST 2'; CLASS county; VAR numfatal; TABLE county, SUM*(numfatal); WHERE alcdriver GT 0; RUN;

halkyos · ‎10-16-2019

How do you suggest I correct this? As far as the FARS data itself goes, it is a dataset from the National Highway Traffic Safety Administration so the initial structure is out of my control. The only variable that was created after the fact is AlcDriver (and further on will include one for drivers who tested positive for other substances). It isn't possible to identify which of the people in the crash was the fatality (well it is, but that requires reading several hundred police reports when dealing with the full data). I tried the following, but wound up with A=10 and B=11: PROC SQL; CREATE TABLE test2 as SELECT casenum as casenum, max(persnum) as persnum, max(numfatal) as numfatal, county as county, sum(alcdriver) as alcdriver FROM test GROUP BY casenum; RUN; QUIT; PROC TABULATE DATA=test2; TITLE 'TEST 2'; CLASS county; VAR numfatal; TABLE county, SUM*(numfatal); WHERE alcdriver GT 0; RUN;

halkyos · ‎10-16-2019

I am working with some 2015-2017 FARS data to analyze traffic deaths by county. One of the subjects of interest is fatalities which occurred in a crash where at least one driver had a BAC above the legal limit. My code works well when an accident only has one driver over the limit, but when there is more than one the fatalities start getting counted multiple times. The following code example sets up some fake data with the actual code I am using in PROC TABULATE (aside from the dataset name and the title): DATA test; input casenum persnum numfatal county $ driver alcdriver; datalines; 1 1 2 A 1 1 1 2 2 A 1 0 1 3 2 A 1 1 1 4 2 A 0 0 1 5 2 A 0 0 2 1 3 B 1 1 2 2 3 B 0 0 2 3 3 B 0 0 3 1 1 B 1 1 3 2 1 B 1 1 4 1 1 B 1 0 ; RUN; PROC TABULATE DATA=test; TITLE 'TEST'; CLASS county; VAR numfatal; TABLE county, SUM*(numfatal); WHERE alcdriver=1; RUN; The variables are the same variable names I am using in the actual data and they represent the following: casenum- A sequential ordering of crashes. This is the same for each person involved in the crash. persnum- A sequential ordering of each person per crash. This begins at 1 and resets for each crash. numfatal- The number of fatalities in the crash, this is the same for each person involved in the crash. county- The county where the crash occurred. driver- if the person was a driver (1) or not (0). alcdriver- if the person was a driver AND over the legal limit (1) or not (0). So looking at the data (not the PROC TABULATE results) we see that county A had one crash (casenum 1) involving five people, two of whom were drunk drivers, resulting in 2 fatalities. County B had 3 crashes: Casenum 2 involved three people with only one drunk driver and three deaths. Casenum 3 involved two people, both drunk drivers, and one death. Casenum 4 had a death, but the driver wasn't drunk. What I should be seeing as an result of summarizing the number of fatalities by county where alcdriver=1 is County A: 2 and County B: 4 What I am seeing is: So it appears that the deaths are getting counted for each drunk driver instead of only once per case. How do I fix this? ------------------------------------------------------------------------- Here is my log: 1 2 DATA test; 3 input casenum persnum numfatal county $ driver alcdriver; 4 datalines; NOTE: The data set WORK.TEST has 11 observations and 6 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.03 seconds 16 ; 17 RUN; 18 PROC TABULATE DATA=test; NOTE: Writing HTML Body file: sashtml.htm 19 TITLE 'TEST'; 20 CLASS county; 21 VAR numfatal; 22 TABLE county, SUM*(numfatal); 23 WHERE alcdriver=1; 24 RUN; NOTE: There were 5 observations read from the data set WORK.TEST. WHERE alcdriver=1; NOTE: PROCEDURE TABULATE used (Total process time): real time 0.50 seconds cpu time 0.40 seconds

halkyos · ‎10-01-2019

I am not seeing any clusters or outliers (this is what I was getting at with the distribution earlier). Taking a closer look, using PROC SGPLOT there are no outliers in the overall data, however when broken down into sub1=0 and sub1=1, there are a few outliers in sub1 for risk factors (rfs>15). There are no other outliers in protective factors for either group.

halkyos · ‎10-01-2019

Alright adding the interaction term only we get: sub1=-2.8998+0.2695(rfs)+0.0755(pfs)+(0.00576(rfs*pfs) I tried putting rfs and pfs as classes and it brought up coefficients for each level of risk and protective factors (and with the rfs*pfs it then did a coefficient for each level of that). It was very messy. With 18+ as a final category we get: sub1=-3.1942+0.3049(rfs)+0.1179(pfs) OR (with effects) sub1=-2.9301+0.2738(rfs)+0.0787(pfs)+0.00529(rfs*pfs) One idea that I am considering, could this be an effect of having it by observation? So instead, should I set it up so there is only one row where substance use=a(0 or 1) and risk factors =b (0-21) and protective factors=c (0-12) and then the frequency of responses where those are true? For example, if there are 12 people who said they use the substance who also reported 3 risk factors and 5 protective factors it would read (sub) 1, (rfs) 3, (pfs) 5, (n) 12. Right now each of those 12 people would be a separate observation in the data set.

halkyos · ‎10-01-2019

Sorry I saw the request and was having trouble getting it all to show up (13x22 table problems). Hopefully this helps. I am not familiar with adding interaction terms. Is this just running the model as sub1=rfs pfs rfs*pfs? (edit: I cropped the images to allow them to be easier to read)

halkyos · ‎10-01-2019

Some additional exploration of the that may help us figure this out: I decided to run four separate models: Substance use predicted by risk factors only Substance use predicted by protective factors only No substance use predicted by risk factors only No substance use predicted by protective factors only The results are: Substance use 'yes'=-2.1091+0.2623(rfs) Substance use 'yes'=0.5846-0.0937(pfs) Substance use 'no'=2.1091-0.2623(rfs) Substance use 'no=-0.5846+0.0937(pfs) So the second half of those were probably unnecessary since they are just the inverse of their counterparts in the first two of the models. All p-values < 0.0001. Standard error on either pfs is 0.00542 and on either rfs is 0.00478. This is to show that, when tested independently, these are behaving as expected: increases in number of risk factors increases the probability of using substances, while increases in protective factors reduces this probability. The problem is arising when they are thrown into a model together.

halkyos · ‎10-01-2019

My PROC CORR results are as follows: There are no high or low outliers for either variable.

halkyos · ‎10-01-2019

The standard errors on the coefficients in the model are 0.00563 (rfs) and 0.00743 (pfs).

halkyos · ‎10-01-2019

PROC CORR is new to me, but looking at the guide on that one it seems pretty straightforward. I ran: PROC CORR DATA=survey; VAR rfs pfs; RUN; My results are:

halkyos · ‎10-01-2019

The standard errors are as follows: rfs: 0.0417 pfs: 0.0261

halkyos · ‎09-30-2019

So I tried this before coming onto here, what is does is switches which of the two has a larger positive coefficient, but both remain positive. My office is renewing my license today so I can't currently give you the exact coefficients, but what happens is it becomes sub1=y+pfs+rfs where the coefficient of pfs> coefficient of rfs; 0<= either coefficient <= 1.

halkyos · ‎09-27-2019

I just tested for these possibilities: A chi-square test for independence indicates that rfs and pfs are independent. When entered into PROC AUTOREG for rfs=pfs and pfs=rfs the values are negatively correlated to each other: chi-sq: 4876.6114, p<0.0001. rfs=12.9714-0.7614(pfs), p<0.0001. pfs=8.7403-0.2989(rfs), p<0.0001. The overall distributions are almost textbook normal, and when stratified to whether or not the observation reported substance use the distribution of rfs for non substance users takes on a right-tail skew. All other distributions remain normal.

halkyos · ‎09-27-2019

Here is my log: 306 PROC LOGISTIC DATA=survey DESCENDING; 307 MODEL sub1=rfs pfs; 308 RUN; NOTE: PROC LOGISTIC is modeling the probability that sub1=1. NOTE: Convergence criterion (GCONV=1E-8) satisfied. NOTE: There were 14445 observations read from the data set WORK.SURVEY. NOTE: PROCEDURE LOGISTIC used (Total process time): real time 0.25 seconds cpu time 0.15 seconds rfs and pfs are positive whole integers unless they are 0 representing the number of risk or protective factors present. rfs ranges from 0-21 and pfs ranges from 0-12.

halkyos · ‎09-27-2019

I am working with risk and protective factor data for outcomes regarding substance use. My data is arranged so that I have an outcome as a binary variable (0=no use, 1= use), the total number of risk factors and the total number of protective factors. Risk factors are known to increase the likelihood of an outcome occurring and protective factors are known to have an opposite effect. Examination in PROC FREQ shows that the proportion of observations using a substance increases with the number of risk factors and decreases with the number of protective factors. When I use PROC LOGISTIC though to write a model, I am getting a positive effect from my protective factors. Here is my code: PROC LOGISTIC DATA=survey DESCENDING; MODEL sub1= rfs pfs; RUN; sub1: binary variable where 1= using the substance and 0=not using the substance. rfs: total number of risk factors. pfs: total number of protective factors. My results for one substance are giving me a model of p(1)=-3.1860+0.3033(rfs)+0.1181(pfs). As a researcher I know that this is wrong, I don't have anomalous data where the population is more likely to use substances if they have more protective factors, but I am having trouble figuring out how to correct this.

Online Status	Offline
Date Last Visited	‎10-17-2019 04:12 PM

Re: Avoid observations being counted twice in PROC TABULATE

Re: Avoid observations being counted twice in PROC TABULATE

Avoid observations being counted twice in PROC TABULATE

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: Avoid observations being counted twice in PROC TABULATE

Re: Avoid observations being counted twice in PROC TABULATE

Avoid observations being counted twice in PROC TABULATE

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

Re: PROC LOGISTIC: Positive effect in logistic model where a negative ...

PROC LOGISTIC: Positive effect in logistic model where a negative one ...