BookmarkSubscribeRSS Feed
Walternate
Obsidian | Level 7

Hi,

 

I have a dataset in which each row is a person's score for a given event on a given date. People have 5 events in each day, as well as a total score which incorporates the 5 event scores but is not a straight sum. People can be in the dataset on multiple days. 

 

ID      Date        Event      Score

1    10/1/2012     1                5

1    10/1/2012     2                3

1    10/1/2012     3                7

1    10/1/2012     4                2

1    10/1/2012     5                3

1    10/1/2012    tot              17

1    10/1/2012    tot              14

2    9/7/2014       1                3

2    9/7/2014       1             null

2    9/7/2014       2               7

2    9/7/2014       2             null

 

I'm trying to clean the data but it has 3 problems:

  1. One person will have multiple sets of events on one day, in which one set has values and one set is null/missing

  2. One person will have multiple sets of events on one day, in which both sets have identical values

  3. One person will have multiple sets of events in one day, in which both sets have different values

 

This is my solution to 1 and 2:

 

Proc sort data=have;

By ID date event descending score; Run;

 

 

Data want;

Set have;

By ID date event descending score;

If first.event output want;

Run;

 

That should output the first of two identical rows or a non-missing row if one set of responses has values and the other set is missing. However, it does not address the third problem. I know I can't get rid of a row at random so if I could maybe create a flag or some sort of indicator that would show that the person had two distinct scores for that event, that would be helpful.

 

Additionally, when I was going through the data, the ones that I noticed people having multiple scores on the same day for were the total scores. If event=tot were the only ones for which a person had two distinct values of score, I could just delete the event=tot rows altogether, transpose the data, then calculate the total scores myself. However, I would need to be able to verify that it was only the event=tot rows that had two distinct responses, and I'm not sure how to do that.

 

Any help is much appreciated.

2 REPLIES 2
LinusH
Tourmaline | Level 20

Are you saying that the stored total that is not a straight sum is wrong, and your calculation would be better?

First, why store totals, if they are, totals. That's the job for the reporting procedure.

 

Second, go back to your source to get better data, or at least more details, Maybe ther's a timestamp for each record, and you could use that for a cleansing rule.

 

Other than that, there is no silver bullet - you need to create a rule, and communicate that very clear to any person looking at this data/report. If you chose the first/last/medium, or just set it to missing/null - pretty much up to business requirements.

Data never sleeps
hbi
Quartz | Level 8 hbi
Quartz | Level 8

This should give you what you need ...

 

DATA have;
  input ID Date :MMDDYY10. Event $ Score;
  format date DATE9.;
  datalines;
  1 10/1/2012 1    5
  1 10/1/2012 2    3
  1 10/1/2012 3    7
  1 10/1/2012 4    2
  1 10/1/2012 5    3
  1 10/1/2012 tot 17
  1 10/1/2012 tot 14
  2 9/7/2014  1    3
  2 9/7/2014  1    .
  2 9/7/2014  2    7
  2 9/7/2014  2    .
;
RUN;


PROC SQL;
  CREATE TABLE want AS 
  SELECT ID
       , Date
       , Event
       , MAX(Score)            AS Max_Score
         /* other helper variables are listed below ... */
       , NMISS(Score)          AS Num_Missing_Score
       , COUNT(DISTINCT Score) AS Num_Distinct_Score
  FROM have
  GROUP BY ID, Date, Event;
QUIT;

 

Here is the output: 

 

wanted_distinct.gif

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 739 views
  • 0 likes
  • 3 in conversation