Solved: changing dataset columns to rows

geneshackman · Posted 07-11-2019 02:56 PM

I have a data set that looks like this, with data for different years in different columns

county	age	data2015	data2016	data2017
1	1	89	17	79
1	2	6	76	49
1	3	87	5	25
1	4	45	17	1
1	5	78	59	45
2	1	35	4	20
2	2	51	95	46
2	3	76	24	8
2	4	60	66	30
2	5	54	28	41

I would like to change this so each year has it's own set of rows, but keeping their county and age, so it would look like this. Is there an easy way to do this? I could read in the data set three times, once for each year, and append each time. But I just wanted to ask if there was an easier way.

county	age	year	data
1	1	2015	89
1	2	2015	6
1	3	2015	87
1	4	2015	45
1	5	2015	78
2	1	2015	35
2	2	2015	51
2	3	2015	76
2	4	2015	60
2	5	2015	54
1	1	2016	17
1	2	2016	76
1	3	2016	5
1	4	2016	17
1	5	2016	59
2	1	2016	4
2	2	2016	95
2	3	2016	24
2	4	2016	66
2	5	2016	28
1	1	2017	79
1	2	2017	49
1	3	2017	25
1	4	2017	1
1	5	2017	45
2	1	2017	20
2	2	2017	46
2	3	2017	8
2	4	2017	30
2	5	2017	41

Kurt_Bremser · Posted 07-11-2019 03:03 PM

Start with a transpose:

proc transpose data=have out=trans;
by county age;
var data:;
run;

Then transform the result

data want;
set trans;
year = input(substr(_name_,4),4.);
rename col1=data;
drop _name_;
run;

and then you only need to sort according to your needs.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 07-11-2019 03:03 PM

Start with a transpose:

proc transpose data=have out=trans;
by county age;
var data:;
run;

Then transform the result

data want;
set trans;
year = input(substr(_name_,4),4.);
rename col1=data;
drop _name_;
run;

and then you only need to sort according to your needs.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

geneshackman · Posted 07-12-2019 09:55 AM

Kurt, just to let you know, your transpose then transform works. Thanks!

novinosrin · Posted 07-11-2019 03:19 PM

Hello @geneshackman

If you are comfortable using Hashes, It's rather straight forward


data have;
input county	age	data2015	data2016	data2017;
cards;
1	1	89	17	79
1	2	6	76	49
1	3	87	5	25
1	4	45	17	1
1	5	78	59	45
2	1	35	4	20
2	2	51	95	46
2	3	76	24	8
2	4	60	66	30
2	5	54	28	41
;

data _null_;
if _n_=1 then do;
   dcl hash H (ordered: "A") ;
   h.definekey  ("year","county","age") ;
   h.definedata ("county","age","year", "data") ;
   h.definedone () ;
end;
set have end=lr;
array d data2015-data2017;
do i=1 to dim(d);
data=d(i);
year=compress(vname(d(i)),,'kd');
rc=h.add();
end;
if lr then h.output(dataset:'want');
run;

NOTE: The solution assumes memory is not a constraint

hashman · Posted 07-11-2019 09:20 PM

@novinosrin :

Your code is memory-bound only because you make no assumption that the input file is sorted - which is an advantage if memory is plentiful. Note that the TRANSPOSE solution by @Kurt_Bremser does rely on the sorted order. If your hash code took advantage of it as well, the hash table memory footprint would be dictated only by the size of the largest group of records by COUNTY. For example:

data have ;                                    
  input county age data2015 data2016 data2017 ;
  cards ;                                      
1      1      89      17      79               
1      2      6       76      49               
1      3      87       5      25               
1      4      45      17       1               
1      5      78      59      45               
2      1      35       4      20               
2      2      51      95      46               
2      3      76      24       8               
2      4      60      66      30               
2      5      54      28      41               
;                                              
run ;                                          

data want (keep = county age year data) ;             
  if _n_ = 1 then do ;                                
    dcl hash h (multidata:"Y", ordered:"A") ;         
    h.definekey  ("year") ;                           
    h.definedata ("year", "age", "data") ;            
    h.definedone () ;                                 
    dcl hiter hi ("h") ;                              
  end ;                                               
  do until (last.county) ;                            
    set have ;                                        
    by county ;                                       
    array dd data: ;                                  
    do over dd ;                                      
      year = input (compress (vname (dd),,"kd"), 4.) ;
      data = dd ;                                     
      h.add() ;                                       
    end ;                                             
  end ;                                               
  do while (hi.next() = 0) ;                          
    output ;                                          
  end ;                                               
  h.clear() ;                                         
run ;

It's instructive to see how the task could be approached in the same vein before the advent of the hash object, i.e. using arrays. Obviously, in this case, the array serving in lieu of the hash table should be either sized using a preliminary pass through the data or just made "large enough" (which is what's done below):

data want (keep = county age year data) ;           
  do j = 1 by 1 until (last.county) ;               
    set have ;                                      
    by county ;                                     
    array dd (i) data: ;                            
    array tt [0:999, 9999] _temporary_ ;            
    do over dd ;                                    
      tt [0,j] = age ;                              
      tt [i,j] = dd ;                               
    end ;                                           
  end ;                                             
  do over dd ;                                      
    year = input (compress (vname (dd),,"kd"), 4.) ;
    do k = 1 to j ;                                 
      age  = tt [0,k] ;                             
      data = tt [i,k] ;                             
      output ;                                      
    end ;                                           
  end ;                                             
run ;

With a more dynamic approach, i.e. with pre-sizing the array bounds using a preliminary pass through HAVE, it could look thusly:

/* size up array TT */                              
data _null_ ;                                       
  do j = 1 by 1 until (last.county) ;               
    set have end = z ;                              
    by county ;                                     
  end ;                                             
  retain jmax ;                                     
  jmax = jmax max j ;                               
  array dd data: ;                                  
  if z ;                                            
  call symputx ("i", dim (dd)) ;                    
  call symputx ("j", jmax    ) ;                    
run ;                                               
/* execute */                                       
data want (keep = county age year data) ;           
  do j = 1 by 1 until (last.county) ;               
    set have ;                                      
    by county ;                                     
    array dd (i) data: ;                            
    array tt [0:&i, &j] _temporary_ ;               
    do over dd ;                                    
      tt [0,j] = age ;                              
      tt [i,j] = dd ;                               
    end ;                                           
  end ;                                             
  do over dd ;                                      
    year = input (compress (vname (dd),,"kd"), 4.) ;
    do k = 1 to j ;                                 
      age  = tt [0,k] ;                             
      data = tt [i,k] ;                             
      output ;                                      
    end ;                                           
  end ;                                             
run ;

Kind regards

Paul D.

hashman · Posted 07-11-2019 03:42 PM

@geneshackman :

I much prefer the offers by @Kurt_Bremser and @novinosrin because they are dynamic, i.e. not hard coded. To better appreciate their beauty, you can compare them with a sledge-hammer hard-coded frontal attack:

data have ;                                     
  input county age data2015 data2016 data2017 ; 
  cards ;                                       
1      1      89      17      79                
1      2      6       76      49                
1      3      87       5      25                
1      4      45      17       1                
1      5      78      59      45                
2      1      35       4      20                
2      2      51      95      46                
2      3      76      24       8                
2      4      60      66      30                
2      5      54      28      41                
;                                               
run ;                                           
                                                
data want (keep = county age year data) ;       
  set have (keep = county age data2015 in = d15)
      have (keep = county age data2016 in = d16)
      have (keep = county age data2017)         
  ;                                             
  if d15 then do ;                              
    year =     2015 ;                           
    data = data2015 ;                           
  end ;                                         
  else if d16 then do ;                         
    year =     2016 ;                           
    data = data2016 ;                           
  end ;                                         
  else do ;                                     
    year =     2017 ;                           
    data = data2017 ;                           
  end ;                                         
run ;

As you can see, if you added another piece of data (related to another year), for example, data2018, you'd need to change the code to accommodate it. Also, it requires as many passes through the data as you have data years. The dynamic approaches are devoid of these drawbacks.

Kind regards

Paul D.

novinosrin · Posted 07-11-2019 04:00 PM

Hello @geneshackman , I may take this opportunity to recommend some books that are priceless to get users to speed in their learning.

1. Book: Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study

The best ever. This link has some preview comments -https://communities.sas.com/t5/Community-Matters/Book-Data-Management-Solutions-Using-SAS-Hash-Table...

2. For Other approaches, I prefer Ron Cody(quite wholesome)

HTH

A sincere request to SAS press @ChrisHemedinger and @ShelleySessoms . I would like to bring to your attention almost everyone whom I met at Citizens seem to be totally unaware of the variety of books(individual topics level by granulairty) offered by SAS press and its details. As a matter of fact, the kind of learning I had at DePaul and the advancement does not parallel but i get paid so I can't complain. lol Consequently, they seem to shy away from application of advance techniques

While my fellow colleagues have observed my excitement to get them to notice the wealth of knowledge they can get from the books, I would like to ask whether SAS press makes their customers aware of what is available coz it seems the contrary at least here at my CT office. Thank you for your attention.

ShelleySessoms · Posted 07-12-2019 08:34 AM

Hi @novinosrin,

I will pass your message along to the SAS Press folks, and provide your email address should they want to discuss ideas with you for your colleagues at Citizens.

Thanks for being such a supporter of SAS Press books.

Best,

Shelley

art297 · Posted 07-11-2019 04:11 PM

An even more general solution that requires less code is the %untranspose macro that you can download from: https://github.com/gerhard1050/Untranspose-a-Wide-File

Using that macro the only code you'd need would be:

%untranspose(data=have, out=want,by=county age, var=data,id=year)

Art, CEO, AnalystFinder.com

ballardw · Posted 07-11-2019 05:20 PM

Perhaps read the data differently to begin with:

data have ;                                     
  input county age @ ; 
  do year = 2015 to 2017;
    input data @;
    output;
  end;
  input;
  cards ;                                       
1      1      89      17      79                
1      2      6       76      49                
1      3      87       5      25                
1      4      45      17       1                
1      5      78      59      45                
2      1      35       4      20                
2      2      51      95      46                
2      3      76      24       8                
2      4      60      66      30                
2      5      54      28      41                
;                                               
run ;

Proc sort to get the order you want if this doesn't work.

geneshackman · Posted 07-12-2019 09:37 AM

Hi Kurt,

Thanks for the suggestion. Question about the first part

proc transpose data=have out=trans;
by county age;
var data:;
run;

Where does the var "data" come from? I guess it's the first part of the data2015, data2016, etc? But I get "variable data not found". Does it have to be certain number of letters long, or do the "data2015", "data2016" all have to have exactly the same var name formats?

Thanks

Tom · Posted 07-12-2019 09:41 AM

If your error messages says that variable DATA was not found then most likely you did not include the colon in the VAR statement.

var data: ;

geneshackman · Posted 07-12-2019 09:45 AM

"If your error messages says that variable DATA was not found then most likely you did not include the colon in the VAR statement."

Correct! Oops, early morning, I wasn't looking closely and thought it was a double ;; that I accidentally added. I added the colon and it worked. Thanks.

changing dataset columns to rows

Re: changing dataset columns to rows

Re: changing dataset columns to rows

Re: changing dataset columns to rows

Re: changing dataset columns to rows

Re: changing dataset columns to rows

Re: changing dataset columns to rows

Re: changing dataset columns to rows

Re: changing dataset columns to rows

Re: changing dataset columns to rows

Re: changing dataset columns to rows

Re: changing dataset columns to rows

Re: changing dataset columns to rows

Re: changing dataset columns to rows

Registration is open

SAS Training: Just a Click Away