BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

The data is basically a list of testing results from thousands of users. Each user can take the test multiple times. The first part of my analysis is to generate the number of people that failed on their first attempt (pretty straightforward...I created a counter variable to enumerate each try and used a data step to limit observations to users who tried at least twice...from there I identified any first time failers where count=1 and pass=0). The other part of the analysis is to generate the number of people (from the original group of people that failed on their first attempt) that eventually passed after failing the first time.

I am having trouble with the code when it comes to sub-setting just the observations that failed on their first try in order to find out how many eventually passed.

My idea is to remove all observations that passed on the first try and all subsequent attempts after the first try (assuming they passed on the first try). Once I have removed those observations, the idea is to enumerate the total number of passes (in the subset of people that failed on their first try) through more if statements and de-duplicating any multiple passes (from multiple attempts from a single user) in a sort statement to find the unique number of passes.

Once I have the unique number of passes (from the group that failed on their first attempt), I think I can subtract that from the number of people that failed on their first attempt in order to find the number of people that never passed.

Example data set ("have"):

Email    Tries    Score   Pass     Count

    1         4         78        0            1

    1         4         100      1            2

    1         4         56        0            3

    1         4         100      1            4

    2         2          67       0            1

    2         2          89       1            2

    3         2          78       0            1

    3         2          89       1            2

    4         4          80       1            1

    4         4          78       0            2

    4         4          81       1            3

    4         4          79       0            4

Code that I attempted:

*creating a counter variable that enumerates each attempt;

proc sort data=have;

by email;

run;

data have1;

set have;

count +1;

by email;

if first.email then count=1;

keep tries pass email count score;

run;

data want;

set have1;

where tries ge 2; *selecting all students who tried at least twice;

if count=1 and pass=0 then firstfail=1; *creating a new variable for anyone that failed on the first try;

else firstfail=0;

if count=1 and pass=1 then delete; *removing observations that passed on the first try; *I know this part is wrong because I need to remove all subsequent observations following the observation that had a pass on the first try;

if 2=<count<=40 and pass=1 then passlater=1; *Again, I know this is bad code because this potentially still includes observations that passed on the first try...I would like to understand how to remove all of the associated counts (i.e. count 2,3,4,....) after the first observation (count=1) that I deleted;

else passlater=0;

run;

proc freq data=want; *this proc will provide the denominator for everyone that failed on their first attempt;

tables firstfail/list;

run;

data want1;

set want;

where passlater=1;

run;

proc sort data=passlater nodupkey;

by studentemail;

run;

proc freq data=passlater;

tables passlater/list;

run;

Thanks for the assistance!

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

I don't understant your full requirement but this bit of code might give you a start:

data have;

input Email    Tries    Score   Pass;

datalines;

    1         4          78       0

    1         4         100       0 

    1         4          56       0

    1         4         100       1 

    2         2          67       0

    2         2          89       0

    3         2          78       0

    3         2          89       1

    4         4          80       1

    4         4          78       0

    4         4          81       1

    4         4          79       0

;

proc sort data=have; by email; run;

/* Add sequence numbers */

data havec;

set have;

by email;

if first.email then count=0;

count+1;

run;

/* Identify Email which fail on count=1 then pass on count>1,

keep count of first pass=1 after fail(s) */

proc sql;

create table failFirstThenPass as

select email, min(count) as passCount

from havec

where email in (select email from havec where count=1 and pass=0)

     and pass=1

group by email;


select * from failFirstThenPass;

quit;

PG

PG

View solution in original post

5 REPLIES 5
art297
Opal | Level 21

Given your data set "have" what should the resulting file look like (i.e., show the desired result)?

ImNoStatistician
Calcite | Level 5

Hi Art,

If the deletions are possible, I would like to see email addresses with a passing first attempt (i.e. email=4 and count=1) removed; however, I would like count=2, count=3, and count=4 removed as well because I am only concerned about email addresses that failed on the first attempt and passed later on (i.e. email=1). To clarify, I would like to only see observations that failed on the first attempt AND passed later on in my final data set. Presumably, this number will be less than or equal to the total number that failed on their first try (and tried at least twice) so I will be able to calculate those that never passed by subtracting the resultant number of observations (those that failed on their first attempt and passed later on) from the total number that failed on their first try after trying at least twice.

Below is the resultant (final) dataset I am hoping to obtain (all "email=4" rows have been removed since email=4 passed on the first attempt ). I would like to see a single indicator for Passlater (1=yes) calculated based on any pass after count 1. So for example, Email=1 has two passes after the fail on count 1 (a pass on count 2 and a pass on count 4) but I would like that to count as a "single" pass since they eventually passed on a later attempt (regardless of which count the pass occurred on).

Email    Tries    Score   Pass     Count     Failfirst     Passlater

    1         4         78        0            1            1                1

    1         4         100      1            2

    1         4         56        0            3

    1         4         100      1            4

    2         2          67       0            1            1                1

    2         2          89       1            2

    3         2          78       0            1            1                1

    3         2          89       1            2

   4         4          80       1            1

    4         4          78       0            2

   4         4          81       1            3

   4         4          79       0            4

  

Thanks!

art297
Opal | Level 21

If that is all you need then how about?:

data want (drop=keepme);

  set have;

  by email;

  retain keepme;

  if first.email then do;

    if pass eq 1 and count eq 1 then keepme=0;

    else keepme=1;

  end;

  if keepme then output;

run;

PGStats
Opal | Level 21

I don't understant your full requirement but this bit of code might give you a start:

data have;

input Email    Tries    Score   Pass;

datalines;

    1         4          78       0

    1         4         100       0 

    1         4          56       0

    1         4         100       1 

    2         2          67       0

    2         2          89       0

    3         2          78       0

    3         2          89       1

    4         4          80       1

    4         4          78       0

    4         4          81       1

    4         4          79       0

;

proc sort data=have; by email; run;

/* Add sequence numbers */

data havec;

set have;

by email;

if first.email then count=0;

count+1;

run;

/* Identify Email which fail on count=1 then pass on count>1,

keep count of first pass=1 after fail(s) */

proc sql;

create table failFirstThenPass as

select email, min(count) as passCount

from havec

where email in (select email from havec where count=1 and pass=0)

     and pass=1

group by email;


select * from failFirstThenPass;

quit;

PG

PG
ImNoStatistician
Calcite | Level 5

This worked! SQL is magic! Thanks PG + Art.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1480 views
  • 3 likes
  • 3 in conversation