Re: merge proportionally (split into equal weights)

podarum · Posted 04-19-2012 10:18 AM

This is definately not an easy task.. so total kudos if anyone that can answer this.

My task is to merge 2 tables into a main 1 (so 3 all together). One of the 2 has data that is aggregated by city (so population by city) and the 2nd is aggregated by regions not in a city in table 1 (so out of city population).... The 3rd main one has data not aggregated, it's at the Postal Code level, but attached to each row is a City Name and an Out-of-City name. So far so good. When I merge all 3, the postal Codes that belong in the City (eg. Toronto) will get Toronto's total population, and that would be double/triple/etc. counting, considering Toronto has many many postal codes. So my task is to equally divide Toronto's population into each of the many many Postal Codes residing under Toronto in List 3. All this by different dates as well..

Confused ?, let me try visually.. THANK YOU..

Table 1

CITY Date Pop

Toronto 2011.01 5,795,389

Toronto 2011.02 5,803,064

Toronto 2011.03 5,810,739

Table2 (Note, Ontario is non-Toronto)

CITY Date Pop

Ontario 2011.01 2,678,462

Ontario 2011.02 2,678,754

Table3

POSTALCODE City Date

M1P 3F7 Toronto 2011.01

M1P 3F7 Toronto 2011.02

M1J 2G8 Toronto 2011.01

M1J 2G8 Toronto 2011.02

M1J 2G8 Toronto 2011.03

L0A 2F6 Ontario 2011.01

L0A 2F6 Ontario 2011.02

L2G 7D3 Ontario 2011.01

L2G 7D3 Ontario 2011.02

WANT:

POSTALCODE City Date Population

M1P 3F7 Toronto 2011.01 1,931,796 <- 5,795,389 divided by the 3 different Postal Codes in Toronto within the same date

M1P 3F7 Toronto 2011.02 2,901,532 <- 5,803,064 divided by the 2 different Postal Codes in Toronto within the same date

M1J 2G8 Toronto 2011.01 1,931,796 <- 5,795,389 divided by the 3 different Postal Codes in Toronto within the same date

M1J 2G8 Toronto 2011.02 2,901,532 <- 5,803,064 divided by the 2 different Postal Codes in Toronto within the same date

M1J 4K9 Toronto 2011.01 1,931,796 <- 5,795,389 divided by the 3 different Postal Codes in Toronto within the same date

L0A 2F6 Ontario 2011.01 1,339,231 <- 2,678,462 divided by the 2 different Postal Codes in Ontario within the same date

L0A 2F6 Ontario 2011.02 1,339,377 <- 2,678,754 divided by the 2 different Postal Codes in Ontario within the same date

L2G 7D3 Ontario 2011.01 1,339,231 <- 2,678,462 divided by the 2 different Postal Codes in Ontario within the same date

L2G 7D3 Ontario 2011.02 1,339,377 <- 2,678,754 divided by the 2 different Postal Codes in Ontario within the same date

Haikuo · Posted 04-19-2012 11:08 AM

Can't figure out the order of your output, but the contents seem to match:

data h1;

input (CITY Date) (:$10.) Pop comma20.;

cards;

Toronto 2011.01 5,795,389

Toronto 2011.02 5,803,064

Toronto 2011.03 5,810,739

;

data h2;

input (CITY Date) (:$10.) Pop comma20.;

cards;

Ontario 2011.01 2,678,462

Ontario 2011.02 2,678,754

;

data h3;

input (POSTALCODE City Date ) (:$&10.) ;

cards;

M1P 3F7 Toronto 2011.01

M1P 3F7 Toronto 2011.02

M1J 2G8 Toronto 2011.01

M1J 2G8 Toronto 2011.02

M1J 2G8 Toronto 2011.03

M1J 4K9 Toronto 2011.01

L0A 2F6 Ontario 2011.01

L0A 2F6 Ontario 2011.02

L2G 7D3 Ontario 2011.01

L2G 7D3 Ontario 2011.02

;

proc sql;

create table want as

select a.*, coalesce(b.pop,c.pop)/count(*) as pop format=10.0 from h3 a

left join

h1 b

on cats(a.city,a.date)=cats(b.city,b.date)

left join

h2 c

on cats(a.city,a.date)=cats(c.city,c.date)

group by a.city,a.date;

quit;

proc print;run;

Haikuo

Haikuo · Posted 04-19-2012 03:02 PM

As a possible data step alternative, here is a Hash approach:

data want (drop=_:);

if _n_=1 then do;

if 0 then do; set h1;set h2; set h3; end;

dcl hash h1(dataset:'h1'); h1.definekey('city','date'); h1.definedata('pop'); h1.definedone();

dcl hash h2(dataset: 'h2');h2.definekey('city','date'); h2.definedata('pop'); h2.definedone();

dcl hash h3(dataset: 'h3', multidata:'y');h3.definekey('city','date'); h3.definedata(all:'y'); h3.definedone();

dcl hiter h3i('h3');

end;

_rc=h3i.first();

do while (_rc=0);

_ct=0;

do _rc=h3.find() by 0 while (_rc=0); _ct+1; _rc=h3.find_next(); end;

_rc=h1.find();

if _rc ne 0 then _rc=h2.find();

pop=pop/_ct;

output;

_rc=h3i.next();

end;

run;

Haikuo

podarum · Posted 04-19-2012 03:06 PM

Thank you.. I'll give it a try..

Ksharp · Posted 04-20-2012 12:55 AM

How about:

data h1;
input (CITY _Date) (:$10.) Pop comma20.;
date=input(cats(_date,'.01'),yymmdd10.);
drop _date;
format date yymmd7.;
cards;
Toronto 2011.01 5,795,389
Toronto 2011.02 5,803,064
Toronto 2011.03 5,810,739
;
run;

data h2;
input (CITY _Date) (:$10.) Pop comma20.;
date=input(cats(_date,'.01'),yymmdd10.);
drop _date;
format date yymmd7.;
cards;
Ontario 2011.01 2,678,462
Ontario 2011.02 2,678,754
;
run;
data h3;
input (POSTALCODE City _Date ) (:$&10.) ;
date=input(cats(_date,'.01'),yymmdd10.);
drop _date;
format date yymmd7.;
cards;
M1P 3F7    Toronto   2011.01
M1P 3F7    Toronto   2011.02
M1J 2G8    Toronto   2011.01
M1J 2G8    Toronto   2011.02
M1J 2G8    Toronto   2011.03
M1J 4K9    Toronto   2011.01
L0A 2F6    Ontario   2011.01
L0A 2F6    Ontario   2011.02
L2G 7D3    Ontario   2011.01
L2G 7D3    Ontario   2011.02
;
run;
proc sql ;
create table want as
select d.*,Population format=comma20.
from  h3 as d,
    (select a.city,a.date,a.pop/level as Population
       from (select * from h1 union all select * from h2) as a ,
            (select city,date,count(distinct postalcode) as level from h3  group by city,date) as b
        where a.city=b.city and a.date=b.date )  as c
 where d.city=c.city and d.date=c.date 
  order by d.POSTALCODE,d.City,d.Date ; 
quit;

Ksharp

podarum · Posted 04-20-2012 08:28 AM

Thanks KSharp..this is perfect..

merge proportionally (split into equal weights)