Hello,
I would like to link and aggregate records (over 10 million) in a dataset to another dataset. Is there a way to achieve this in one task? Thank you
Data 1 - what I have
City | City_Code | Item 1 | Item 2 | Item 3 | Item 4 |
City A | 123001001 | 10 | 20 | 30 | 40 |
City B | 123001002 | 20 | 30 | 40 | 50 |
City C | 123002001 | 30 | 40 | 50 | 60 |
City D | 123002002 | 40 | 50 | 60 | 70 |
City E | 123003001 | 50 | 60 | 70 | 80 |
City F | 123003002 | 60 | 70 | 80 | 90 |
City G | 123004001 | 70 | 80 | 90 | 100 |
City H | 123004002 | 80 | 90 | 100 | 110 |
Data 2 - what I want (the region code forms the beginning part of the city code)
Region | Region_Code | Item 1 | Item 2 | Item 3 | Item 4 |
North | 123001 | 30 | 50 | 70 | 90 |
South | 123002 | 70 | 90 | 110 | 130 |
East | 123003 | 110 | 130 | 150 | 170 |
West | 123004 | 150 | 170 | 190 | 210 |
Thank you Reeza.
Please can you clarify 'Now apply the Region name to the data'
I can't seem to format the city code, even though the format contains fewer characters, when I run the program it retains the region code. The codes are numeric
Thanks Reeza,
I also had to transpose the output into the format I preferred.
SQL is perhaps the best (or at least simplest) option:
data city ;
input @1 City $6. City_Code:$9. Item1-item4 ;
cards ;
City A 123001001 10 20 30 40
City B 123001002 20 30 40 50
City C 123002001 30 40 50 60
City D 123002002 40 50 60 70
City E 123003001 50 60 70 80
City F 123003002 60 70 80 90
City G 123004001 70 80 90 100
City H 123004002 80 90 100 110
;
data region ;
input Region $ Region_Code:$6. ;
cards ;
North 123001
South 123002
East 123003
West 123004
;
run ;
proc sql ;
create table want as
select region
, region_code
, sum (item1) as item1
, sum (item2) as item2
, sum (item3) as item3
, sum (item4) as item4
from city, region
where region_code = put (city_code, $6.)
group 1, 2
order 2
;
quit ;
If you loathe to list all the items as above, especially if there are way many than 4, heed what @Reeza has said. It can be done in a single DATA step, too (i.e. without listing all the items), using the hash object; but if you see the requisite code, you'll sure appreciate the simplicity of SQL:
data _null_ ;
if _n_ = 1 then do ;
if 0 then set region city ;
array it item: ;
dcl hash h (ordered:"a") ;
h.definekey ("region_code") ;
h.definedata ("region", "region_code") ;
do over it ;
h.definedata (vname (it)) ;
end ;
h.definedone() ;
do until (lr) ;
set region end = lr ;
h.add() ;
end ;
end ;
set city (rename=(item1-item4=_it1-_it4)) end = lc ;
array _it _it: ;
region_code = put (city_code, $6.) ;
if h.find() ne 0 then call missing (of item:) ;
do over it ;
it + _it ;
end ;
h.replace() ;
if lc then h.output (dataset:"want") ;
run ;
OTOH, instead of listing all the sum(item1) as item1, ... in the SQL query, you can auto-construct a macro variable containing the necessary text:
data _null_ ;
length sumit $ 32767 ;
do i = 1 to 4 ;
sumit = catx (",", sumit, cats ("sum(item", i, ") as item", i)) ;
end ;
call symputx ("sumit", sumit) ;
run ;
proc sql ;
create table want as
select region
, region_code
, &sumit
from city, region
where region_code = put (city_code, $6.)
group 1, 2
order 2
;
quit ;
Kind regards
Paul D.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.