Hello,
I would like to link and aggregate records (over 10 million) in a dataset to another dataset. Is there a way to achieve this in one task? Thank you
Data 1 - what I have
City | City_Code | Item 1 | Item 2 | Item 3 | Item 4 |
City A | 123001001 | 10 | 20 | 30 | 40 |
City B | 123001002 | 20 | 30 | 40 | 50 |
City C | 123002001 | 30 | 40 | 50 | 60 |
City D | 123002002 | 40 | 50 | 60 | 70 |
City E | 123003001 | 50 | 60 | 70 | 80 |
City F | 123003002 | 60 | 70 | 80 | 90 |
City G | 123004001 | 70 | 80 | 90 | 100 |
City H | 123004002 | 80 | 90 | 100 | 110 |
Data 2 - what I want (the region code forms the beginning part of the city code)
Region | Region_Code | Item 1 | Item 2 | Item 3 | Item 4 |
North | 123001 | 30 | 50 | 70 | 90 |
South | 123002 | 70 | 90 | 110 | 130 |
East | 123003 | 110 | 130 | 150 | 170 |
West | 123004 | 150 | 170 | 190 | 210 |
Thank you Reeza.
Please can you clarify 'Now apply the Region name to the data'
I can't seem to format the city code, even though the format contains fewer characters, when I run the program it retains the region code. The codes are numeric
Thanks Reeza,
I also had to transpose the output into the format I preferred.
SQL is perhaps the best (or at least simplest) option:
data city ;
input @1 City $6. City_Code:$9. Item1-item4 ;
cards ;
City A 123001001 10 20 30 40
City B 123001002 20 30 40 50
City C 123002001 30 40 50 60
City D 123002002 40 50 60 70
City E 123003001 50 60 70 80
City F 123003002 60 70 80 90
City G 123004001 70 80 90 100
City H 123004002 80 90 100 110
;
data region ;
input Region $ Region_Code:$6. ;
cards ;
North 123001
South 123002
East 123003
West 123004
;
run ;
proc sql ;
create table want as
select region
, region_code
, sum (item1) as item1
, sum (item2) as item2
, sum (item3) as item3
, sum (item4) as item4
from city, region
where region_code = put (city_code, $6.)
group 1, 2
order 2
;
quit ;
If you loathe to list all the items as above, especially if there are way many than 4, heed what @Reeza has said. It can be done in a single DATA step, too (i.e. without listing all the items), using the hash object; but if you see the requisite code, you'll sure appreciate the simplicity of SQL:
data _null_ ;
if _n_ = 1 then do ;
if 0 then set region city ;
array it item: ;
dcl hash h (ordered:"a") ;
h.definekey ("region_code") ;
h.definedata ("region", "region_code") ;
do over it ;
h.definedata (vname (it)) ;
end ;
h.definedone() ;
do until (lr) ;
set region end = lr ;
h.add() ;
end ;
end ;
set city (rename=(item1-item4=_it1-_it4)) end = lc ;
array _it _it: ;
region_code = put (city_code, $6.) ;
if h.find() ne 0 then call missing (of item:) ;
do over it ;
it + _it ;
end ;
h.replace() ;
if lc then h.output (dataset:"want") ;
run ;
OTOH, instead of listing all the sum(item1) as item1, ... in the SQL query, you can auto-construct a macro variable containing the necessary text:
data _null_ ;
length sumit $ 32767 ;
do i = 1 to 4 ;
sumit = catx (",", sumit, cats ("sum(item", i, ") as item", i)) ;
end ;
call symputx ("sumit", sumit) ;
run ;
proc sql ;
create table want as
select region
, region_code
, &sumit
from city, region
where region_code = put (city_code, $6.)
group 1, 2
order 2
;
quit ;
Kind regards
Paul D.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.