handling duplicates

yoyong555 · Posted 09-05-2019 05:20 PM

Hi. I have this data set with duplicates.

I want to add a Group column where if an ID with duplicates received all the same scores, then they are labelled as 1, otherwise, labelled as 2. If an ID has no duplicate, then it is labelled as 3.

data have ;

input ID $ score ;

cards ;

1 10

2 5

3 20

3 21
3 20

4 15

4     18
5     .
6     8
6     8
7     17
7     17
7     25

;

run ;

The result would look like this.

ID Score Group
1 10 1

1 10 1

2 5 3

3 20 2

3 21 2
3 20 2

4 15 2

4     18         2
5      . 3
6     8 1
6     8 1
7     17         2
7     17         2
7     25         2

Then I want to create a table summarizing the results counting the IDs (only once if duplicates) that had the same scores (Group 1), different scores (Group 2) and those without duplicates (Group 3).

The table would look something like this.

Group Frequency

1 2

2 3

3 2

Thank you for your help!

novinosrin · Posted 09-05-2019 05:54 PM

proc sql;
create table want as
select *, case when count(distinct score) in (1,0) and count(score) in (1,0) then 3
          when count(distinct score)=1 and count(score)>1 then 1
		  when count(distinct score)>1 then 2
		  end as group
from have
group by id
order by id;
quit;

yoyong555 · Posted 09-05-2019 08:16 PM

Thank you @novinosrin .

I got this comment.

NOTE: A CASE expression has no ELSE clause. Cases not accounted for by the
WHEN clauses will result in a missing value for the CASE expression.
NOTE: The query requires remerging summary statistics back with the original
data.
NOTE: Table WORK.WANT created, with 14 rows and 3 columns.

194 order by id;
-----
180
ERROR 180-322: Statement is not valid or it is used out of proper order.

novinosrin · Posted 09-05-2019 08:31 PM

Hello @yoyong555 ok added the else, however i did not get any error. here is my test


data have ;                                                                

  input ID $ score ;                                                        

  cards ;                                                                  
1     10                                                                    
1     10                                                                     
2     5                                                                    
3     20                                                                    
3     21
3     20                                                                    
4     15                                                                     
4     18
5     . 
6     8
6     8
7     17
7     17
7     25                                                                   
;                                                                           

run ;

 
proc sql;
create table want as
select *, case when count(distinct score) in (1,0) and count(score) in (1,0) then 3
          when count(distinct score)=1 and count(score)>1 then 1
		  when count(distinct score)>1 then 2
		  else .  end as group
from have
group by id
order by id;
quit;

NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.WANT created, with 14 rows and 3 columns.

46   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.02 seconds
      cpu time            0.03 seconds

hashman · Posted 09-07-2019 05:26 PM

@yoyong555:

Since your input is sorted by ID, both output data sets can be generated in a single step with 2 passes through the input data:

data have ;                                                                                                                             
  input ID $ score ;                                                                                                                    
  cards ;                                                                                                                               
1  10                                                                                                                                   
1  10                                                                                                                                   
2   5                                                                                                                                   
3  20                                                                                                                                   
3  21                                                                                                                                   
3  20                                                                                                                                   
4  15                                                                                                                                   
4  18                                                                                                                                   
5   .                                                                                                                                   
6   8                                                                                                                                   
6   8                                                                                                                                   
7  17                                                                                                                                   
7  17                                                                                                                                   
7  25                                                                                                                                   
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data group (keep = id score group)                                                                                                      
     freq  (keep = group freq)                                                                                                          
  ;                                                                                                                                     
  do _n_ = 1 by 1 until (last.id) ;                                                                                                     
    set have ;                                                                                                                          
    by id ;                                                                                                                             
    if _n_ = 1 then _score1 = score ;                                                                                                   
    else if _score1 ne score then group = 2 ;                                                                                           
  end ;                                                                                                                                 
                                                                                                                                        
  if _n_ = 1 then group = 3 ;                                                                                                           
  else if group ne 2 then group = 1 ;                                                                                                   
                                                                                                                                        
  array fq [3] _temporary_ ;                                                                                                            
  fq [group] + 1 ;                                                                                                                      
                                                                                                                                        
  do _n_ = 1 to _n_ ;                                                                                                                   
    set have end = lr ;                                                                                                                 
    output group ;                                                                                                                      
  end ;                                                                                                                                 
                                                                                                                                        
  if lr then do group = 1 to dim (fq) ;                                                                                                 
    freq = fq [group] ;                                                                                                                 
    output freq ;                                                                                                                       
  end ;                                                                                                                                 
run ;

Each BY group is read twice:

On the first pass: SCORE in the second and subsequent records is compared with SCORE in the first record. If there's a difference, GROUP=2.
After the first pass (between the two DoW-loops): (A) the program looks at the number of records _N_ from the current BY group. If _N_=1, GROUP=3; otherwise if GROUP is not 2, all scores in the BY group are the same, so GROUP=1. (B) Now that the GROUP value for the current BY group has been found, 1 is added to the array item whose index is equal to GROUP.
On the second pass (the second DoW-loop): Every record from the current BY group is written to file GROUP with the value of GROUP added.
After the second pass (after the second DoW-loop): If the last record from the second input stream (related to the second SET) has just been read, LR (short for "last record") is auto-set to 1 from its initial automatic value of 0. In this case, the frequencies are gathered from the array and written to file FREQ.
Program control goes back to the top of the step, enters the first DoW-loop, and executed the SET statement. Since all the records from the first input stream (related to the first SET) have been read, its input buffer is empty. The attempt to read from an empty buffer causes the DATA step to terminate.

Simple!

Kind regards

Paul D.

handling duplicates

Re: handling duplicates

Re: handling duplicates

Re: handling duplicates

Re: handling duplicates

handling duplicates

Re: handling duplicates

Re: handling duplicates

Re: handling duplicates

Re: handling duplicates

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away