BookmarkSubscribeRSS Feed
yoyong555
Obsidian | Level 7

 

Hi. I have this data set with duplicates.

 

I want to add a Group column where if an ID with duplicates received  all the same scores, then they are labelled as 1, otherwise,  labelled as 2. If an ID has no duplicate, then it is labelled as 3.

 

data have ;                                                                

  input ID $ score ;                                                        

  cards ;                                                                  

1     10                                                                    

1     10                                                                     

2     5                                                                    

3     20                                                                    

3     21
3     20                                                                    

4     15                                                                     

4     18
5     . 
6     8
6     8
7     17
7     17
7     25                                                                   

;                                                                           

run ;

 

The result would look like this.

 

ID   Score   Group
1     10         1                                                           

1     10         1                                                           

2     5           3                                                            

3     20         2                                                           

3     21         2  
3     20         2                                                           

4     15         2                                                            

4     18         2
5      .           3
6     8           1
6     8           1
7     17         2
7     17         2
7     25         2

 

Then I want to create a table summarizing the results counting the IDs (only once if duplicates) that had the same scores (Group 1), different scores (Group 2) and those without duplicates (Group 3).

 

The table would look something like this.

 

Group         Frequency

1                     2

2                     3

3                     2

 

Thank you for your help!

 

4 REPLIES 4
novinosrin
Tourmaline | Level 20
proc sql;
create table want as
select *, case when count(distinct score) in (1,0) and count(score) in (1,0) then 3
          when count(distinct score)=1 and count(score)>1 then 1
		  when count(distinct score)>1 then 2
		  end as group
from have
group by id
order by id;
quit;
yoyong555
Obsidian | Level 7

Thank you @novinosrin .

I got this comment.

 

NOTE: A CASE expression has no ELSE clause. Cases not accounted for by the
WHEN clauses will result in a missing value for the CASE expression.
NOTE: The query requires remerging summary statistics back with the original
data.
NOTE: Table WORK.WANT created, with 14 rows and 3 columns.

194 order by id;
-----
180
ERROR 180-322: Statement is not valid or it is used out of proper order.

 

novinosrin
Tourmaline | Level 20

Hello @yoyong555  ok added the else, however i did not get any error. here is my test

 


data have ;                                                                

  input ID $ score ;                                                        

  cards ;                                                                  
1     10                                                                    
1     10                                                                     
2     5                                                                    
3     20                                                                    
3     21
3     20                                                                    
4     15                                                                     
4     18
5     . 
6     8
6     8
7     17
7     17
7     25                                                                   
;                                                                           

run ;

 
proc sql;
create table want as
select *, case when count(distinct score) in (1,0) and count(score) in (1,0) then 3
          when count(distinct score)=1 and count(score)>1 then 1
		  when count(distinct score)>1 then 2
		  else .  end as group
from have
group by id
order by id;
quit;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: Table WORK.WANT created, with 14 rows and 3 columns.

46   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.02 seconds
      cpu time            0.03 seconds

 

 

hashman
Ammonite | Level 13

@yoyong555:

Since your input is sorted by ID, both output data sets can be generated in a single step with 2 passes through the input data:

data have ;                                                                                                                             
  input ID $ score ;                                                                                                                    
  cards ;                                                                                                                               
1  10                                                                                                                                   
1  10                                                                                                                                   
2   5                                                                                                                                   
3  20                                                                                                                                   
3  21                                                                                                                                   
3  20                                                                                                                                   
4  15                                                                                                                                   
4  18                                                                                                                                   
5   .                                                                                                                                   
6   8                                                                                                                                   
6   8                                                                                                                                   
7  17                                                                                                                                   
7  17                                                                                                                                   
7  25                                                                                                                                   
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data group (keep = id score group)                                                                                                      
     freq  (keep = group freq)                                                                                                          
  ;                                                                                                                                     
  do _n_ = 1 by 1 until (last.id) ;                                                                                                     
    set have ;                                                                                                                          
    by id ;                                                                                                                             
    if _n_ = 1 then _score1 = score ;                                                                                                   
    else if _score1 ne score then group = 2 ;                                                                                           
  end ;                                                                                                                                 
                                                                                                                                        
  if _n_ = 1 then group = 3 ;                                                                                                           
  else if group ne 2 then group = 1 ;                                                                                                   
                                                                                                                                        
  array fq [3] _temporary_ ;                                                                                                            
  fq [group] + 1 ;                                                                                                                      
                                                                                                                                        
  do _n_ = 1 to _n_ ;                                                                                                                   
    set have end = lr ;                                                                                                                 
    output group ;                                                                                                                      
  end ;                                                                                                                                 
                                                                                                                                        
  if lr then do group = 1 to dim (fq) ;                                                                                                 
    freq = fq [group] ;                                                                                                                 
    output freq ;                                                                                                                       
  end ;                                                                                                                                 
run ;                                       

Each BY group is read twice:

  1. On the first pass: SCORE in the second and subsequent records is compared with SCORE in the first record. If there's a difference, GROUP=2.
  2. After the first pass (between the two DoW-loops): (A) the program looks at the number of records _N_ from the current BY group. If _N_=1, GROUP=3; otherwise if GROUP is not 2, all scores in the BY group are the same, so GROUP=1. (B) Now that the GROUP value for the current BY group has been found, 1 is added to the array item whose index is equal to GROUP.
  3. On the second pass (the second DoW-loop): Every record from the current BY group is written to file GROUP with the value of GROUP added.
  4. After the second pass (after the second DoW-loop): If the last record from the second input stream (related to the second SET) has just been read, LR (short for "last record") is auto-set to 1 from its initial automatic value of 0. In this case, the frequencies are gathered from the array and written to file FREQ.
  5. Program control goes back to the top of the step, enters the first DoW-loop, and executed the SET statement. Since all the records from the first input stream (related to the first SET) have been read, its input buffer is empty. The attempt to read from an empty buffer causes the DATA step to terminate.

Simple!

 

Kind regards

Paul D. 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 381 views
  • 0 likes
  • 3 in conversation