10-15-2015 10:18 AM
I have a dataset in which each row is a person's score for a given event on a given date. People have 5 events in each day, as well as a total score which incorporates the 5 event scores but is not a straight sum. People can be in the dataset on multiple days.
ID Date Event Score
1 10/1/2012 1 5
1 10/1/2012 2 3
1 10/1/2012 3 7
1 10/1/2012 4 2
1 10/1/2012 5 3
1 10/1/2012 tot 17
1 10/1/2012 tot 14
2 9/7/2014 1 3
2 9/7/2014 1 null
2 9/7/2014 2 7
2 9/7/2014 2 null
I'm trying to clean the data but it has 3 problems:
1. One person will have multiple sets of events on one day, in which one set has values and one set is null/missing
2. One person will have multiple sets of events on one day, in which both sets have identical values
3. One person will have multiple sets of events in one day, in which both sets have different values
This is my solution to 1 and 2:
Proc sort data=have;
By ID date event descending score; Run;
By ID date event descending score;
If first.event output want;
That should output the first of two identical rows or a non-missing row if one set of responses has values and the other set is missing. However, it does not address the third problem. I know I can't get rid of a row at random so if I could maybe create a flag or some sort of indicator that would show that the person had two distinct scores for that event, that would be helpful.
Additionally, when I was going through the data, the ones that I noticed people having multiple scores on the same day for were the total scores. If event=tot were the only ones for which a person had two distinct values of score, I could just delete the event=tot rows altogether, transpose the data, then calculate the total scores myself. However, I would need to be able to verify that it was only the event=tot rows that had two distinct responses, and I'm not sure how to do that.
Any help is much appreciated.
10-15-2015 11:06 AM
Are you saying that the stored total that is not a straight sum is wrong, and your calculation would be better?
First, why store totals, if they are, totals. That's the job for the reporting procedure.
Second, go back to your source to get better data, or at least more details, Maybe ther's a timestamp for each record, and you could use that for a cleansing rule.
Other than that, there is no silver bullet - you need to create a rule, and communicate that very clear to any person looking at this data/report. If you chose the first/last/medium, or just set it to missing/null - pretty much up to business requirements.
10-15-2015 08:35 PM
This should give you what you need ...
DATA have; input ID Date :MMDDYY10. Event $ Score; format date DATE9.; datalines; 1 10/1/2012 1 5 1 10/1/2012 2 3 1 10/1/2012 3 7 1 10/1/2012 4 2 1 10/1/2012 5 3 1 10/1/2012 tot 17 1 10/1/2012 tot 14 2 9/7/2014 1 3 2 9/7/2014 1 . 2 9/7/2014 2 7 2 9/7/2014 2 . ; RUN; PROC SQL; CREATE TABLE want AS SELECT ID , Date , Event , MAX(Score) AS Max_Score /* other helper variables are listed below ... */ , NMISS(Score) AS Num_Missing_Score , COUNT(DISTINCT Score) AS Num_Distinct_Score FROM have GROUP BY ID, Date, Event; QUIT;
Here is the output: