This is definately not an easy task.. so total kudos if anyone that can answer this.
My task is to merge 2 tables into a main 1 (so 3 all together). One of the 2 has data that is aggregated by city (so population by city) and the 2nd is aggregated by regions not in a city in table 1 (so out of city population).... The 3rd main one has data not aggregated, it's at the Postal Code level, but attached to each row is a City Name and an Out-of-City name. So far so good. When I merge all 3, the postal Codes that belong in the City (eg. Toronto) will get Toronto's total population, and that would be double/triple/etc. counting, considering Toronto has many many postal codes. So my task is to equally divide Toronto's population into each of the many many Postal Codes residing under Toronto in List 3. All this by different dates as well..
Confused ?, let me try visually.. THANK YOU..
Table 1
CITY Date Pop
Toronto 2011.01 5,795,389
Toronto 2011.02 5,803,064
Toronto 2011.03 5,810,739
Table2 (Note, Ontario is non-Toronto)
CITY Date Pop
Ontario 2011.01 2,678,462
Ontario 2011.02 2,678,754
Table3
POSTALCODE City Date
M1P 3F7 Toronto 2011.01
M1P 3F7 Toronto 2011.02
M1J 2G8 Toronto 2011.01
M1J 2G8 Toronto 2011.02
M1J 2G8 Toronto 2011.03
L0A 2F6 Ontario 2011.01
L0A 2F6 Ontario 2011.02
L2G 7D3 Ontario 2011.01
L2G 7D3 Ontario 2011.02
WANT:
POSTALCODE City Date Population
M1P 3F7 Toronto 2011.01 1,931,796 <- 5,795,389 divided by the 3 different Postal Codes in Toronto within the same date
M1P 3F7 Toronto 2011.02 2,901,532 <- 5,803,064 divided by the 2 different Postal Codes in Toronto within the same date
M1J 2G8 Toronto 2011.01 1,931,796 <- 5,795,389 divided by the 3 different Postal Codes in Toronto within the same date
M1J 2G8 Toronto 2011.02 2,901,532 <- 5,803,064 divided by the 2 different Postal Codes in Toronto within the same date
M1J 4K9 Toronto 2011.01 1,931,796 <- 5,795,389 divided by the 3 different Postal Codes in Toronto within the same date
L0A 2F6 Ontario 2011.01 1,339,231 <- 2,678,462 divided by the 2 different Postal Codes in Ontario within the same date
L0A 2F6 Ontario 2011.02 1,339,377 <- 2,678,754 divided by the 2 different Postal Codes in Ontario within the same date
L2G 7D3 Ontario 2011.01 1,339,231 <- 2,678,462 divided by the 2 different Postal Codes in Ontario within the same date
L2G 7D3 Ontario 2011.02 1,339,377 <- 2,678,754 divided by the 2 different Postal Codes in Ontario within the same date
Can't figure out the order of your output, but the contents seem to match:
data h1;
input (CITY Date) (:$10.) Pop comma20.;
cards;
Toronto 2011.01 5,795,389
Toronto 2011.02 5,803,064
Toronto 2011.03 5,810,739
;
data h2;
input (CITY Date) (:$10.) Pop comma20.;
cards;
Ontario 2011.01 2,678,462
Ontario 2011.02 2,678,754
;
data h3;
input (POSTALCODE City Date ) (:$&10.) ;
cards;
M1P 3F7 Toronto 2011.01
M1P 3F7 Toronto 2011.02
M1J 2G8 Toronto 2011.01
M1J 2G8 Toronto 2011.02
M1J 2G8 Toronto 2011.03
M1J 4K9 Toronto 2011.01
L0A 2F6 Ontario 2011.01
L0A 2F6 Ontario 2011.02
L2G 7D3 Ontario 2011.01
L2G 7D3 Ontario 2011.02
;
proc sql;
create table want as
select a.*, coalesce(b.pop,c.pop)/count(*) as pop format=10.0 from h3 a
left join
h1 b
on cats(a.city,a.date)=cats(b.city,b.date)
left join
h2 c
on cats(a.city,a.date)=cats(c.city,c.date)
group by a.city,a.date;
quit;
proc print;run;
Haikuo
As a possible data step alternative, here is a Hash approach:
data want (drop=_:);
if _n_=1 then do;
if 0 then do; set h1;set h2; set h3; end;
dcl hash h1(dataset:'h1'); h1.definekey('city','date'); h1.definedata('pop'); h1.definedone();
dcl hash h2(dataset: 'h2');h2.definekey('city','date'); h2.definedata('pop'); h2.definedone();
dcl hash h3(dataset: 'h3', multidata:'y');h3.definekey('city','date'); h3.definedata(all:'y'); h3.definedone();
dcl hiter h3i('h3');
end;
_rc=h3i.first();
do while (_rc=0);
_ct=0;
do _rc=h3.find() by 0 while (_rc=0); _ct+1; _rc=h3.find_next(); end;
_rc=h1.find();
if _rc ne 0 then _rc=h2.find();
pop=pop/_ct;
output;
_rc=h3i.next();
end;
run;
Haikuo
Thank you.. I'll give it a try..
How about:
data h1; input (CITY _Date) (:$10.) Pop comma20.; date=input(cats(_date,'.01'),yymmdd10.); drop _date; format date yymmd7.; cards; Toronto 2011.01 5,795,389 Toronto 2011.02 5,803,064 Toronto 2011.03 5,810,739 ; run; data h2; input (CITY _Date) (:$10.) Pop comma20.; date=input(cats(_date,'.01'),yymmdd10.); drop _date; format date yymmd7.; cards; Ontario 2011.01 2,678,462 Ontario 2011.02 2,678,754 ; run; data h3; input (POSTALCODE City _Date ) (:$&10.) ; date=input(cats(_date,'.01'),yymmdd10.); drop _date; format date yymmd7.; cards; M1P 3F7 Toronto 2011.01 M1P 3F7 Toronto 2011.02 M1J 2G8 Toronto 2011.01 M1J 2G8 Toronto 2011.02 M1J 2G8 Toronto 2011.03 M1J 4K9 Toronto 2011.01 L0A 2F6 Ontario 2011.01 L0A 2F6 Ontario 2011.02 L2G 7D3 Ontario 2011.01 L2G 7D3 Ontario 2011.02 ; run; proc sql ; create table want as select d.*,Population format=comma20. from h3 as d, (select a.city,a.date,a.pop/level as Population from (select * from h1 union all select * from h2) as a , (select city,date,count(distinct postalcode) as level from h3 group by city,date) as b where a.city=b.city and a.date=b.date ) as c where d.city=c.city and d.date=c.date order by d.POSTALCODE,d.City,d.Date ; quit;
Ksharp
Thanks KSharp..this is perfect..
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.