I have a data set that looks like this, with data for different years in different columns
county | age | data2015 | data2016 | data2017 |
1 | 1 | 89 | 17 | 79 |
1 | 2 | 6 | 76 | 49 |
1 | 3 | 87 | 5 | 25 |
1 | 4 | 45 | 17 | 1 |
1 | 5 | 78 | 59 | 45 |
2 | 1 | 35 | 4 | 20 |
2 | 2 | 51 | 95 | 46 |
2 | 3 | 76 | 24 | 8 |
2 | 4 | 60 | 66 | 30 |
2 | 5 | 54 | 28 | 41 |
I would like to change this so each year has it's own set of rows, but keeping their county and age, so it would look like this. Is there an easy way to do this? I could read in the data set three times, once for each year, and append each time. But I just wanted to ask if there was an easier way.
county | age | year | data |
1 | 1 | 2015 | 89 |
1 | 2 | 2015 | 6 |
1 | 3 | 2015 | 87 |
1 | 4 | 2015 | 45 |
1 | 5 | 2015 | 78 |
2 | 1 | 2015 | 35 |
2 | 2 | 2015 | 51 |
2 | 3 | 2015 | 76 |
2 | 4 | 2015 | 60 |
2 | 5 | 2015 | 54 |
1 | 1 | 2016 | 17 |
1 | 2 | 2016 | 76 |
1 | 3 | 2016 | 5 |
1 | 4 | 2016 | 17 |
1 | 5 | 2016 | 59 |
2 | 1 | 2016 | 4 |
2 | 2 | 2016 | 95 |
2 | 3 | 2016 | 24 |
2 | 4 | 2016 | 66 |
2 | 5 | 2016 | 28 |
1 | 1 | 2017 | 79 |
1 | 2 | 2017 | 49 |
1 | 3 | 2017 | 25 |
1 | 4 | 2017 | 1 |
1 | 5 | 2017 | 45 |
2 | 1 | 2017 | 20 |
2 | 2 | 2017 | 46 |
2 | 3 | 2017 | 8 |
2 | 4 | 2017 | 30 |
2 | 5 | 2017 | 41 |
Start with a transpose:
proc transpose data=have out=trans;
by county age;
var data:;
run;
Then transform the result
data want;
set trans;
year = input(substr(_name_,4),4.);
rename col1=data;
drop _name_;
run;
and then you only need to sort according to your needs.
Start with a transpose:
proc transpose data=have out=trans;
by county age;
var data:;
run;
Then transform the result
data want;
set trans;
year = input(substr(_name_,4),4.);
rename col1=data;
drop _name_;
run;
and then you only need to sort according to your needs.
Kurt, just to let you know, your transpose then transform works. Thanks!
Hello @geneshackman
If you are comfortable using Hashes, It's rather straight forward
data have;
input county age data2015 data2016 data2017;
cards;
1 1 89 17 79
1 2 6 76 49
1 3 87 5 25
1 4 45 17 1
1 5 78 59 45
2 1 35 4 20
2 2 51 95 46
2 3 76 24 8
2 4 60 66 30
2 5 54 28 41
;
data _null_;
if _n_=1 then do;
dcl hash H (ordered: "A") ;
h.definekey ("year","county","age") ;
h.definedata ("county","age","year", "data") ;
h.definedone () ;
end;
set have end=lr;
array d data2015-data2017;
do i=1 to dim(d);
data=d(i);
year=compress(vname(d(i)),,'kd');
rc=h.add();
end;
if lr then h.output(dataset:'want');
run;
NOTE: The solution assumes memory is not a constraint
Your code is memory-bound only because you make no assumption that the input file is sorted - which is an advantage if memory is plentiful. Note that the TRANSPOSE solution by @Kurt_Bremser does rely on the sorted order. If your hash code took advantage of it as well, the hash table memory footprint would be dictated only by the size of the largest group of records by COUNTY. For example:
data have ;
input county age data2015 data2016 data2017 ;
cards ;
1 1 89 17 79
1 2 6 76 49
1 3 87 5 25
1 4 45 17 1
1 5 78 59 45
2 1 35 4 20
2 2 51 95 46
2 3 76 24 8
2 4 60 66 30
2 5 54 28 41
;
run ;
data want (keep = county age year data) ;
if _n_ = 1 then do ;
dcl hash h (multidata:"Y", ordered:"A") ;
h.definekey ("year") ;
h.definedata ("year", "age", "data") ;
h.definedone () ;
dcl hiter hi ("h") ;
end ;
do until (last.county) ;
set have ;
by county ;
array dd data: ;
do over dd ;
year = input (compress (vname (dd),,"kd"), 4.) ;
data = dd ;
h.add() ;
end ;
end ;
do while (hi.next() = 0) ;
output ;
end ;
h.clear() ;
run ;
It's instructive to see how the task could be approached in the same vein before the advent of the hash object, i.e. using arrays. Obviously, in this case, the array serving in lieu of the hash table should be either sized using a preliminary pass through the data or just made "large enough" (which is what's done below):
data want (keep = county age year data) ;
do j = 1 by 1 until (last.county) ;
set have ;
by county ;
array dd (i) data: ;
array tt [0:999, 9999] _temporary_ ;
do over dd ;
tt [0,j] = age ;
tt [i,j] = dd ;
end ;
end ;
do over dd ;
year = input (compress (vname (dd),,"kd"), 4.) ;
do k = 1 to j ;
age = tt [0,k] ;
data = tt [i,k] ;
output ;
end ;
end ;
run ;
With a more dynamic approach, i.e. with pre-sizing the array bounds using a preliminary pass through HAVE, it could look thusly:
/* size up array TT */
data _null_ ;
do j = 1 by 1 until (last.county) ;
set have end = z ;
by county ;
end ;
retain jmax ;
jmax = jmax max j ;
array dd data: ;
if z ;
call symputx ("i", dim (dd)) ;
call symputx ("j", jmax ) ;
run ;
/* execute */
data want (keep = county age year data) ;
do j = 1 by 1 until (last.county) ;
set have ;
by county ;
array dd (i) data: ;
array tt [0:&i, &j] _temporary_ ;
do over dd ;
tt [0,j] = age ;
tt [i,j] = dd ;
end ;
end ;
do over dd ;
year = input (compress (vname (dd),,"kd"), 4.) ;
do k = 1 to j ;
age = tt [0,k] ;
data = tt [i,k] ;
output ;
end ;
end ;
run ;
Kind regards
Paul D.
I much prefer the offers by @Kurt_Bremser and @novinosrin because they are dynamic, i.e. not hard coded. To better appreciate their beauty, you can compare them with a sledge-hammer hard-coded frontal attack:
data have ;
input county age data2015 data2016 data2017 ;
cards ;
1 1 89 17 79
1 2 6 76 49
1 3 87 5 25
1 4 45 17 1
1 5 78 59 45
2 1 35 4 20
2 2 51 95 46
2 3 76 24 8
2 4 60 66 30
2 5 54 28 41
;
run ;
data want (keep = county age year data) ;
set have (keep = county age data2015 in = d15)
have (keep = county age data2016 in = d16)
have (keep = county age data2017)
;
if d15 then do ;
year = 2015 ;
data = data2015 ;
end ;
else if d16 then do ;
year = 2016 ;
data = data2016 ;
end ;
else do ;
year = 2017 ;
data = data2017 ;
end ;
run ;
As you can see, if you added another piece of data (related to another year), for example, data2018, you'd need to change the code to accommodate it. Also, it requires as many passes through the data as you have data years. The dynamic approaches are devoid of these drawbacks.
Kind regards
Paul D.
Hello @geneshackman , I may take this opportunity to recommend some books that are priceless to get users to speed in their learning.
1. Book: Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study
The best ever. This link has some preview comments -https://communities.sas.com/t5/Community-Matters/Book-Data-Management-Solutions-Using-SAS-Hash-Table...
2. For Other approaches, I prefer Ron Cody(quite wholesome)
HTH
A sincere request to SAS press @ChrisHemedinger and @ShelleySessoms . I would like to bring to your attention almost everyone whom I met at Citizens seem to be totally unaware of the variety of books(individual topics level by granulairty) offered by SAS press and its details. As a matter of fact, the kind of learning I had at DePaul and the advancement does not parallel but i get paid so I can't complain. lol Consequently, they seem to shy away from application of advance techniques
While my fellow colleagues have observed my excitement to get them to notice the wealth of knowledge they can get from the books, I would like to ask whether SAS press makes their customers aware of what is available coz it seems the contrary at least here at my CT office. Thank you for your attention.
Hi @novinosrin,
I will pass your message along to the SAS Press folks, and provide your email address should they want to discuss ideas with you for your colleagues at Citizens.
Thanks for being such a supporter of SAS Press books.
Best,
Shelley
An even more general solution that requires less code is the %untranspose macro that you can download from: https://github.com/gerhard1050/Untranspose-a-Wide-File
Using that macro the only code you'd need would be:
%untranspose(data=have, out=want,by=county age, var=data,id=year)
Art, CEO, AnalystFinder.com
Perhaps read the data differently to begin with:
data have ; input county age @ ; do year = 2015 to 2017; input data @; output; end; input; cards ; 1 1 89 17 79 1 2 6 76 49 1 3 87 5 25 1 4 45 17 1 1 5 78 59 45 2 1 35 4 20 2 2 51 95 46 2 3 76 24 8 2 4 60 66 30 2 5 54 28 41 ; run ;
Proc sort to get the order you want if this doesn't work.
Hi Kurt,
Thanks for the suggestion. Question about the first part
proc transpose data=have out=trans;
by county age;
var data:;
run;
Where does the var "data" come from? I guess it's the first part of the data2015, data2016, etc? But I get "variable data not found". Does it have to be certain number of letters long, or do the "data2015", "data2016" all have to have exactly the same var name formats?
Thanks
If your error messages says that variable DATA was not found then most likely you did not include the colon in the VAR statement.
var data: ;
"If your error messages says that variable DATA was not found then most likely you did not include the colon in the VAR statement."
Correct! Oops, early morning, I wasn't looking closely and thought it was a double ;; that I accidentally added. I added the colon and it worked. Thanks.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.