BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
xiangpang
Quartz | Level 8

Hello,

 

I want to add some rows in my dataset 'eq'. What I want is y should include '.  0 1 2'. If one of them is missing in eq, then add back it in 'want', and x will be 0. 

Actually, I have many IDs. Could anyone tell me how to do it? 

 

Thanks

data eq;
input ID y x ;
cards;
1 1 27 
1 0 . 
1 . 30 
1 2 38 
2 . 23 
2 0 32  
2 2 . 
3 0 33 
3 1 21 
3 2 13 
4 1 56 
4 0 67 
;
run;

want
1 1 27 
1 0 . 
1 . 30 
1 2 38 
2 . 23 
2 0 32 
2 1 0 
2 2 . 
3 0 33 
3 1 21 
3 . 0 
3 2 13  
4 1 56 
4 0 67   
4 . 0
4 2 0  

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Another way:

proc sql; 
  create table WANT as
    select  a.ID
          , b.Y
          , coalesce(c.X, 0) as X
    from (select unique ID from EQ)  a
        full outer join
         (select unique Y  from EQ)  b
         on 1
        left join
        EQ                           c
        on  a.ID = c.ID
        and b.Y  = c.Y
    order by 1,2;
quit;
ID Y X
1 . 30
1 0 0
1 1 27
1 2 38
2 . 23
2 0 32
2 1 0
2 2 0
3 . 0
3 0 33
3 1 21
3 2 13
4 . 0
4 0 67
4 1 56
4 2 0

 

View solution in original post

20 REPLIES 20
mjabed600
Fluorite | Level 6

I'm having a very difficult time trying to understand what your question really is... Can you elaborate a little more or may be rephrase the question?

xiangpang
Quartz | Level 8

Sorry for that. 

Original data has some missing data. For each ID, they should have one of the 4 different y  (. or 0 or 1 or 2), which means 4 rows for each ID. so I need to add the missing row back. When the missing row added back, the value of x will be 0. I am not sure whether I explained it clear now. 

ChrisNZ
Tourmaline | Level 20

Another way:

proc sql; 
  create table WANT as
    select  a.ID
          , b.Y
          , coalesce(c.X, 0) as X
    from (select unique ID from EQ)  a
        full outer join
         (select unique Y  from EQ)  b
         on 1
        left join
        EQ                           c
        on  a.ID = c.ID
        and b.Y  = c.Y
    order by 1,2;
quit;
ID Y X
1 . 30
1 0 0
1 1 27
1 2 38
2 . 23
2 0 32
2 1 0
2 2 0
3 . 0
3 0 33
3 1 21
3 2 13
4 . 0
4 0 67
4 1 56
4 2 0

 

ChrisNZ
Tourmaline | Level 20

A variation:

proc sql; 
    select  a.ID
          , b.Y
          , sum(X, 0) as X
    from (select unique ID from EQ) a
        inner join
         (select unique Y  from EQ) b
         on 1
        left join
         EQ                          c
         on  a.ID = c.ID
         and b.Y  = c.Y
    order by 1,2;
quit;
PGStats
Opal | Level 21

@ChrisNZ, Note, to request a cross product, you can replace (...) inner join (...) on 1 by (...) cross join (...)

PG
hashman
Ammonite | Level 13

@PGStats

 

Sure; but it still a Cartesian product. A thought in a different direction would be: Is it possible, in this case, to reformulate the query to make the optimizer avoid it. One could speculate that it could take advantage of the sorted input if prompted with SortedBy=ID, yet it doesn't and, judging from the _method messaging, still sorts behind-the-scenes. That said, even with the Cartesian product, @ChrisNZ 's query performs tolerably well against a data set with a couple of million IDs and 10 numeric variables added as ballast, about 10 secs flat on my laptop. But since DATA step by-processing gets there in 1/4 of the time, I guess the optimizer may not be smart enough to opt for a better optimized path (or maybe there's just no provision for it).       

 

Paul D.

PGStats
Opal | Level 21

My note was only about SAS/SQL syntax. I wouldn't advocate using SQL for such a problem, unless perhaps if the data resides in a distant DBMS, in the hope that SQL could get the server to perform the query.

 

I would be interested in a timing comparison between your proposed hash- and array-based solutions.

PG
ChrisNZ
Tourmaline | Level 20

It's a small Cartesian product (one column in each table, unique values only) and it's very legible.

Unless performance is actually an issue, I'd go for legible (and I care a lot about performance!).

A data step performs a single sequential read, so will be faster, but the guy who comes after you will curse you for spending so much time understanding the logic rather than reading a very simple join.

hashman
Ammonite | Level 13

In general, I agree with you in terms of legibility vs insignificant differences in performance (though I care a lot about performance, too). In this case, however, it's very simple DATA step code with rather straightforward standard BY processing logic:

 

For each BY group:

- output every record as is. 

- store Y in a table (hash or otherwise)

After each BY group:

- Output a record with X=0 for each value of (. 0 1 2) not found in the table

- Clean the table 

 

This kind of logic is bread and butter of any SAS programmer. If one understands the task, this logic is obvious, and, conversely, the nature of the task can be easily understood from this kind of code. Besides, "the guy who comes after" me would read my comments explaining both before starting to read the code. As to whether SQL is more legible than an equivalent DATA step, it depends on the nature of the task and the background of whomever is trying to comprehend it. With mine, if I were to reconstruct this particular task from code alone, I'd do it far more readily from the DATA step in question than from your query; for other people (such as those who had learned SQL before procedural programming), it may be just the opposite. Suum cuique.

 

Best

Paul D.  

 

 

ChrisNZ
Tourmaline | Level 20
hashman
Ammonite | Level 13

@PGStats,

 

Sure, eager to oblige:

 

data eq (keep = ID X Y) ;               
  retain r "integer" ;                  
  call streaminit (7) ;                 
  array yy [4] _temporary_ (. 0 1 2) ;  
  do ID = 1 to 5E6 ;                    
    do _n_ = 1 to rand (r, dim (yy)) ;  
      Y = yy (rand (r, dim (yy))) ;     
      X = rand (r, 99) ;                
      output ;                          
    end ;                               
  end ;                                 
run ;                                   
                                        
data kxarr (keep = ID x y) ;            
  array _f [-1:2] ;                     
  array _v [-1:2] (. 0 1 2) ;           
  do until (last.id) ;                  
    set eq ;                            
    by ID ;                             
    output ;                            
    if nmiss (y)       then _f[-1] = 1 ;
    else if y in (0:2) then _f[ y] = 1 ;
  end ;                                 
  x = 0 ;                               
  do j = lbound (_f) to hbound (_f) ;   
    if _f[j] then continue ;            
    y = _v[j] ;                         
    output ;                            
  end ;                                 
run ;  
                                 
data hash ;                       
  if _n_ = 1 then do ;             
    dcl hash h () ;                
    h.definekey ("y") ;            
    h.definedone () ;              
  end ;                            
  do until (last.id) ;             
    set eq ;                       
    by ID ;                        
    output ;                       
    h.ref() ;                      
  end ;                            
  x = 0 ;                          
  do y = . , 0 to 2 ;              
    if h.check() ne 0 then output ;
  end ;                            
  h.clear() ;                      
run ;      

 

Results (in seconds):

 

1. Key-indexed array: 3.96

2. Hash object: 12.66

 

A pretty stark difference, isn't it? Sure, but it stands to reason: Key-indexing, within its range limitations, is the fastest search algorithm there is because:

 

- There's no hash function overhead.

- A value is simply assigned to the array cell whose index is equal to the key-value.

- There's no need to search before "inserting" the value - it merely overwrites what's there.

- There's no cost of run-time memory allocation, as it is allocated at compile time.

- There's practically no cost cleaning it up after each BY group.

 

As opposed to that, the hash table needs to:

 

- Compute the hash function.

- Search the table to see if the value is already in the table and make a decision based on this finding and the specs.

- Traverse an AVL tree before inserting the value. Though pretty fast, it's not nearly as fast as just sticking it into an array.  

- Allocate memory at run time for each new key-value being added.

- Wipe the contents of the table out after each BY group. Again, it costs much more than setting the array values to missing.

 

Can the hash table be made perform better? Sure it can. Here the heaviest burden comes from the excessive clean-up since under this particular arrangement the BY-groups are very numerous and the CLEAR method gets called as many times as there are distinct IDs. However, the groups are small and the hash memory footprint isn't an issue, so to improve performance, we can add ID to the key portion and clean the table after each Nth BY group. The value of N needs to be chosen optimally: If we set it too high, we'll increase the time needed for memory allocation, as a fuller table requires more time for that, plus letting the table get too big between the successive Ns means more cleaning effort. It's sort of like striking a balance between cleaning a house way too often and way too rarely: If we clean every minute, we will do nothing but clean; and if we do it every few months ... you get the picture. Having tinkered a little before posting, I found the optimal value to be N~50 and also zeroed in on hashexp:9 (instead of the default 8). Thus, the code needs only minor changes (in red bold):

data hash ;                            
  if _n_ = 1 then do ;                 
    dcl hash h (hashexp:9) ;                    
    h.definekey ("ID", "y") ;          
    h.definedone () ;                  
  end ;                                
  do until (last.id) ;                 
    set eq ;                           
    by ID ;                            
    output ;                           
    h.ref() ;                          
  end ;                                
  x = 0 ;                              
  do y = . , 0 to 2 ;                  
    if h.check() ne 0 then output ;    
  end ;                                
  if not mod (_n_, 50) then h.clear() ;
run ;                                  

Simply reducing the clean-up frequency in this manner reduces the hash run time to 7 seconds flat. While still a far cry from the key-index's 3.96 seconds (for the reasons we can't control), it's a huge improvement against 12.66. 

 

The advantage of the hash object lies in is its much wider applicability. If instead of tracking Y ranging from -1 to 2 we had to track, say, Y1-Y10 with a integer value range exceeding ~1E8 (let alone if they were character variable longer than $3), we wouldn't be able to key-index within reasonable memory constraints, while the hash table would still work - we would only have to augment it accordingly.    

Best

Paul D.

 

 

  

xiangpang
Quartz | Level 8

Thanks everyone. My case is more complex than the sample. But I am pleased to learn the basic idea from you guys. I did not know hash before. Is there any book recommended for a beginner?

 

Thanks again. 

 

hashman
Ammonite | Level 13

Hope you don't mind if I recommend the book coauthored by myself. I can honestly aver that I've read it from cover to cover :).

 

https://www.sas.com/store/books/categories/examples/data-management-solutions-using-sas-hash-table-o...

 

As to the level, it only assumes that the reader is fairly competent in DATA step programming. As far as the hash object is concerned, it starts ab ovo, and immersion grows as you progress through. We've also endeavored to present the subject in a systematic way from the standpoint of general table operations (such as CRUD) and avoid aping SAS documentation like plague. 

 

Paul D.

 

 

xiangpang
Quartz | Level 8

Thanks for your patience and help

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 20 replies
  • 3356 views
  • 7 likes
  • 5 in conversation