topic Re: how to add a row in SAS Programming

how to add a row

xiangpang — Mon, 30 Jul 2018 01:54:27 GMT

Hello,

I want to add some rows in my dataset 'eq'. What I want is y should include '. 0 1 2'. If one of them is missing in eq, then add back it in 'want', and x will be 0.

Actually, I have many IDs. Could anyone tell me how to do it?

Thanks

data eq;
input ID y x ;
cards;
1 1 27 
1 0 . 
1 . 30 
1 2 38 
2 . 23 
2 0 32  
2 2 . 
3 0 33 
3 1 21 
3 2 13 
4 1 56 
4 0 67 
;
run;

want
1 1 27 
1 0 . 
1 . 30 
1 2 38 
2 . 23 
2 0 32 
2 1 0 
2 2 . 
3 0 33 
3 1 21 
3 . 0 
3 2 13  
4 1 56 
4 0 67   
4 . 0
4 2 0

Re: how to add a row

mjabed600 — Mon, 30 Jul 2018 02:28:59 GMT

I'm having a very difficult time trying to understand what your question really is... Can you elaborate a little more or may be rephrase the question?

Re: how to add a row

xiangpang — Mon, 30 Jul 2018 02:43:44 GMT

Sorry for that.

Original data has some missing data. For each ID, they should have one of the 4 different y (. or 0 or 1 or 2), which means 4 rows for each ID. so I need to add the missing row back. When the missing row added back, the value of x will be 0. I am not sure whether I explained it clear now.

Re: how to add a row

PGStats — Mon, 30 Jul 2018 03:16:26 GMT

If you don't mind reordering the y's :

data eq;
input ID y x ;
cards;
1 1 27 
1 0 . 
1 . 30 
1 2 38 
2 . 23 
2 0 32  
2 2 . 
3 0 33 
3 1 21 
3 2 13 
4 1 56 
4 0 67 
;


data want;
array v{0:3} _temporary_;

do i = lbound(v) to hbound(v); v{i} = 0; end;

do until(last.id);
    set eq; by id;
    v{coalesce(y, 3)} = x;
    end;

do y = ., 0, 1, 2;
    x = v{coalesce(y, 3)};
    output;
    end;

drop i;
run;

proc print data=want noobs; run;

Re: how to add a row

hashman — Mon, 30 Jul 2018 04:23:53 GMT

@xiangpang,

There're many ways to do what you want. I'll show primarily how a hash table can be used to keep track of what you have and what - not.

First, if your data are actually sorted by ID (as in your sample):

data eq ;                          
  input ID y x ;                   
  cards ;                          
1 1 27                             
1 0  .                             
1 . 30                             
1 2 38                             
2 . 23                             
2 0 32                             
2 2  .                             
3 0 33                             
3 1 21                             
3 2 13                             
4 1 56                             
4 0 67                             
;                                  
run ;                              
                                   
data want ;                        
  if _n_ = 1 then do ;             
    dcl hash h (multidata:"y") ;   
    h.definekey ("y") ;            
    h.definedone () ;              
  end ;                            
  set eq ;                         
  by ID ;                          
  output ;                         
  h.add() ;                        
  if last.ID ;                     
  x = 0 ;                          
  do y = . , 0 to 2 ;              
    if h.check() ne 0 then output ;
  end ;                            
  h.clear() ;                      
run ;

If EQ is not initially ordered:

data _null_ ; 
 dcl hash h (multidata:"y", ordered:"a") ;
 h.definekey ("id", "y") ; 
 h.definedata ("id", "y", "x") ; 
 h.definedone () ; 
 dcl hash a () ; 
 a.definekey ("id") ; 
 a.definedone () ; 
 dcl hiter i ("a") ; 
 do until (z) ; 
   set eq end = z ; 
   h.add() ; 
   a.ref() ; 
 end ; 
 x = 0 ; 
 do while (i.next() = 0) ; 
   do y = . , 0 to 2 ; 
     h.ref() ; 
   end ; 
 end ; 
 h.output (dataset:"want") ; 
run ;

A nice extra of this step is that your data will come out sorted by both ID and Y.

If you're averse to using the SAS hash object, the same as in step #! can be done using olde goode arrays (minisclule since you only have 4 values to track). The following again assumes that EQ is sorted by ID:

data want (keep = ID x y) ;          
  array _f [-1:2] ;                  
  array _v [-1:2] (. 0 1 2) ;        
  do until (last.id) ;               
    set eq ;                         
    by ID ;                          
    output ;                         
    if nmiss (y)       then _f[-1] = 1 ;   
    else if y in (0:2) then _f[ y] = 1 ;   
  end ;                              
  x = 0 ;                            
  do j = lbound (_f) to hbound (_f) ;
    if _f[j] then continue ;         
    y = _v[j] ;                      
    output ;                         
  end ;                              
run ;

HTH

Pail D.

Re: how to add a row

hashman — Mon, 30 Jul 2018 04:45:37 GMT

@PGStats,

Extremely ingenious. Kudos! And I'm sure you realize that it's constrained by the assumptions that:

(a) EQ has no records with Y not in (. 0 1 2)

(b) EQ has no variables but ID, X, and Y.

But within the limits of the sample data as presented, I can't think of anything more clever or concise.

Best

Paul D.

Re: how to add a row

ChrisNZ — Mon, 30 Jul 2018 05:35:43 GMT

Another way:

proc sql; 
  create table WANT as
    select  a.ID
          , b.Y
          , coalesce(c.X, 0) as X
    from (select unique ID from EQ)  a
        full outer join
         (select unique Y  from EQ)  b
         on 1
        left join
        EQ                           c
        on  a.ID = c.ID
        and b.Y  = c.Y
    order by 1,2;
quit;

ID	Y	X
1	.	30
1	0	0
1	1	27
1	2	38
2	.	23
2	0	32
2	1	0
2	2	0
3	.	0
3	0	33
3	1	21
3	2	13
4	.	0
4	0	67
4	1	56
4	2	0

Re: how to add a row

ChrisNZ — Mon, 30 Jul 2018 05:39:26 GMT

A variation:

proc sql; 
    select  a.ID
          , b.Y
          , sum(X, 0) as X
    from (select unique ID from EQ) a
        inner join
         (select unique Y  from EQ) b
         on 1
        left join
         EQ                          c
         on  a.ID = c.ID
         and b.Y  = c.Y
    order by 1,2;
quit;

Re: how to add a row

PGStats — Mon, 30 Jul 2018 17:32:15 GMT

@ChrisNZ, Note, to request a cross product, you can replace (...) inner join (...) on 1 by (...) cross join (...)

Re: how to add a row

hashman — Mon, 30 Jul 2018 18:39:56 GMT

@PGStats,

Sure; but it still a Cartesian product. A thought in a different direction would be: Is it possible, in this case, to reformulate the query to make the optimizer avoid it. One could speculate that it could take advantage of the sorted input if prompted with SortedBy=ID, yet it doesn't and, judging from the _method messaging, still sorts behind-the-scenes. That said, even with the Cartesian product, @ChrisNZ 's query performs tolerably well against a data set with a couple of million IDs and 10 numeric variables added as ballast, about 10 secs flat on my laptop. But since DATA step by-processing gets there in 1/4 of the time, I guess the optimizer may not be smart enough to opt for a better optimized path (or maybe there's just no provision for it).

Paul D.

Re: how to add a row

PGStats — Mon, 30 Jul 2018 19:05:00 GMT

My note was only about SAS/SQL syntax. I wouldn't advocate using SQL for such a problem, unless perhaps if the data resides in a distant DBMS, in the hope that SQL could get the server to perform the query.

I would be interested in a timing comparison between your proposed hash- and array-based solutions.

Re: how to add a row

ChrisNZ — Mon, 30 Jul 2018 21:27:45 GMT

@PGStats Thanks. I never used cross join before. One more in the toolbox... 🙂

Re: how to add a row

ChrisNZ — Mon, 30 Jul 2018 21:33:53 GMT

It's a small Cartesian product (one column in each table, unique values only) and it's very legible.

Unless performance is actually an issue, I'd go for legible (and I care a lot about performance!).

A data step performs a single sequential read, so will be faster, but the guy who comes after you will curse you for spending so much time understanding the logic rather than reading a very simple join.

Re: how to add a row

hashman — Mon, 30 Jul 2018 22:26:04 GMT

@PGStats,

Sure, eager to oblige:

data eq (keep = ID X Y) ;               
  retain r "integer" ;                  
  call streaminit (7) ;                 
  array yy [4] _temporary_ (. 0 1 2) ;  
  do ID = 1 to 5E6 ;                    
    do _n_ = 1 to rand (r, dim (yy)) ;  
      Y = yy (rand (r, dim (yy))) ;     
      X = rand (r, 99) ;                
      output ;                          
    end ;                               
  end ;                                 
run ;                                   
                                        
data kxarr (keep = ID x y) ;            
  array _f [-1:2] ;                     
  array _v [-1:2] (. 0 1 2) ;           
  do until (last.id) ;                  
    set eq ;                            
    by ID ;                             
    output ;                            
    if nmiss (y)       then _f[-1] = 1 ;
    else if y in (0:2) then _f[ y] = 1 ;
  end ;                                 
  x = 0 ;                               
  do j = lbound (_f) to hbound (_f) ;   
    if _f[j] then continue ;            
    y = _v[j] ;                         
    output ;                            
  end ;                                 
run ;  
                                 
data hash ;                       
  if _n_ = 1 then do ;             
    dcl hash h () ;                
    h.definekey ("y") ;            
    h.definedone () ;              
  end ;                            
  do until (last.id) ;             
    set eq ;                       
    by ID ;                        
    output ;                       
    h.ref() ;                      
  end ;                            
  x = 0 ;                          
  do y = . , 0 to 2 ;              
    if h.check() ne 0 then output ;
  end ;                            
  h.clear() ;                      
run ;

Results (in seconds):

1. Key-indexed array: 3.96

2. Hash object: 12.66

A pretty stark difference, isn't it? Sure, but it stands to reason: Key-indexing, within its range limitations, is the fastest search algorithm there is because:

- There's no hash function overhead.

- A value is simply assigned to the array cell whose index is equal to the key-value.

- There's no need to search before "inserting" the value - it merely overwrites what's there.

- There's no cost of run-time memory allocation, as it is allocated at compile time.

- There's practically no cost cleaning it up after each BY group.

As opposed to that, the hash table needs to:

- Compute the hash function.

- Search the table to see if the value is already in the table and make a decision based on this finding and the specs.

- Traverse an AVL tree before inserting the value. Though pretty fast, it's not nearly as fast as just sticking it into an array.

- Allocate memory at run time for each new key-value being added.

- Wipe the contents of the table out after each BY group. Again, it costs much more than setting the array values to missing.

Can the hash table be made perform better? Sure it can. Here the heaviest burden comes from the excessive clean-up since under this particular arrangement the BY-groups are very numerous and the CLEAR method gets called as many times as there are distinct IDs. However, the groups are small and the hash memory footprint isn't an issue, so to improve performance, we can add ID to the key portion and clean the table after each Nth BY group. The value of N needs to be chosen optimally: If we set it too high, we'll increase the time needed for memory allocation, as a fuller table requires more time for that, plus letting the table get too big between the successive Ns means more cleaning effort. It's sort of like striking a balance between cleaning a house way too often and way too rarely: If we clean every minute, we will do nothing but clean; and if we do it every few months ... you get the picture. Having tinkered a little before posting, I found the optimal value to be N~50 and also zeroed in on hashexp:9 (instead of the default 8). Thus, the code needs only minor changes (in red bold):

data hash ;                            
  if _n_ = 1 then do ;                 
    dcl hash h (hashexp:9) ;                    
    h.definekey ("ID", "y") ;          
    h.definedone () ;                  
  end ;                                
  do until (last.id) ;                 
    set eq ;                           
    by ID ;                            
    output ;                           
    h.ref() ;                          
  end ;                                
  x = 0 ;                              
  do y = . , 0 to 2 ;                  
    if h.check() ne 0 then output ;    
  end ;                                
  if not mod (_n_, 50) then h.clear() ;
run ;

Simply reducing the clean-up frequency in this manner reduces the hash run time to 7 seconds flat. While still a far cry from the key-index's 3.96 seconds (for the reasons we can't control), it's a huge improvement against 12.66.

The advantage of the hash object lies in is its much wider applicability. If instead of tracking Y ranging from -1 to 2 we had to track, say, Y1-Y10 with a integer value range exceeding ~1E8 (let alone if they were character variable longer than $3), we wouldn't be able to key-index within reasonable memory constraints, while the hash table would still work - we would only have to augment it accordingly.

Best

Paul D.

Re: how to add a row

hashman — Mon, 30 Jul 2018 23:51:29 GMT

In general, I agree with you in terms of legibility vs insignificant differences in performance (though I care a lot about performance, too). In this case, however, it's very simple DATA step code with rather straightforward standard BY processing logic:

For each BY group:

- output every record as is.

- store Y in a table (hash or otherwise)

After each BY group:

- Output a record with X=0 for each value of (. 0 1 2) not found in the table

- Clean the table

This kind of logic is bread and butter of any SAS programmer. If one understands the task, this logic is obvious, and, conversely, the nature of the task can be easily understood from this kind of code. Besides, "the guy who comes after" me would read my comments explaining both before starting to read the code. As to whether SQL is more legible than an equivalent DATA step, it depends on the nature of the task and the background of whomever is trying to comprehend it. With mine, if I were to reconstruct this particular task from code alone, I'd do it far more readily from the DATA step in question than from your query; for other people (such as those who had learned SQL before procedural programming), it may be just the opposite. Suum cuique.

Best

Paul D.

Re: how to add a row

ChrisNZ — Mon, 30 Jul 2018 23:55:33 GMT

Suum cuique indeed 🙂

Re: how to add a row

xiangpang — Tue, 31 Jul 2018 00:18:02 GMT

Thanks everyone. My case is more complex than the sample. But I am pleased to learn the basic idea from you guys. I did not know hash before. Is there any book recommended for a beginner?

Thanks again.

Re: how to add a row

PGStats — Tue, 31 Jul 2018 02:40:58 GMT

@hashman, Wow, that's a very nice demonstration, and some very instructive explanations. Please continue to impress us with hash based solutions. Speaking for myself, this is one area where I don't feel so confident. Thank you.

Re: how to add a row

hashman — Tue, 31 Jul 2018 02:41:52 GMT

Hope you don't mind if I recommend the book coauthored by myself. I can honestly aver that I've read it from cover to cover :).

https://www.sas.com/store/books/categories/examples/data-management-solutions-using-sas-hash-table-operations-a-business-intelligence-case-study/prodBK_69153_en.html

As to the level, it only assumes that the reader is fairly competent in DATA step programming. As far as the hash object is concerned, it starts ab ovo, and immersion grows as you progress through. We've also endeavored to present the subject in a systematic way from the standpoint of general table operations (such as CRUD) and avoid aping SAS documentation like plague.

Paul D.

Re: how to add a row

xiangpang — Tue, 31 Jul 2018 03:02:20 GMT

Thanks for your patience and help