This does not solve your problem but I
am having trouble coming with example data.
I was able to join a 160,000,000 11 variable table
with a 6,000,000 2 variable table in under 2 minutes.
I have a very slow old 2008 computer with DDR2 ram.
Newer computers should cut the time in half.
I was unable to create a 14m result?
Do not see a need to parallelize the code(yet)
/* 60k result is a far cry from 14m but I don't see how to get 14m
My resulting table
Result of join of 160,000,000 and 6,000,000
NOTE: Table SD1.RES created, with 60000 rows and 10 columns.
261 ! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 2:03.27
cpu time 3:19.24
*/
Maybe you only need one variable in the potential 1:1 relationships?
Here are my assumed cardinalities
3 Years
4 Countries
10,000 erg_ids ( algorithm is relatively insensitive to this variable)
1,000 different offices (assume office_name is 1:1 with office)
1,000 different office names (assume office_name is 1:1 with office)
100 different cars 1:1 car names
100 different car names
400 redidues 1:1 with residue name
400 residue name
INPUTS
Up to 40 obs from spde.dmperg160m total obs=160,000,000
OFFICE_
STA_ID OFFICE COUNTRY NAME CAR CAR_NAME RESIDUE RESIDUENAME ERG_ID YEAR
1 A049; C01 D049; E049 F049 G04C H04C 2 2005
1 A0324 C02 D0324 E032 F03F G03< H03< 61 2004
1 A0348 C02 D0348 E034 F03> G03> H03> 81 2003
1 A0;32 C04 D0;32 E0;3 F0;3 G0;= H0;= 49 2004
1 A0252 C01 D0252 E025 F025 G025 H025 80 2005
1 A035: C02 D035: E035 F03S G03? H03? 21 2003
1 A053: C02 D053: E053 F05= G05= H05= 83 2005
1 A069: C04 D069: E069 F069 G0@9 H0@9 54 2003
.....
Up to 40 obs from spde.dmpsta17m total obs=6,000,000
CARTYPE_
Obs STA_ID ID
1 1 33
2 2 44
3 3 55
4 4 66
5 5 77
6 6 88
7 7 99
8 8 ::
9 9 ;;
10 10 <<
WANT ( Cannot get 14m which may imply exact duplicate records)
Up to 40 obs SD1.RES total obs=60,000
OFFICE_ CARTYPE_
Obs OFFICE COUNTRY NAME CAR CAR_NAME RESIDUE RESIDUENAME STA_IDS ERG_IDS IDS
1 A0222 C01 D0222 E022 F022 G022 H022 280 2 43
2 A0222 C01 D0222 E022 F022 G02< H02< 282 2 42
3 A0222 C01 D0222 E022 F022 G0<2 H0<2 152 1 38
4 A0222 C01 D0222 E022 F022 G0<< H0<< 170 1 37
5 A0222 C01 D0222 E022 F02< G022 H022 318 2 42
6 A0222 C01 D0222 E022 F02< G02< H02< 306 2 46
7 A0222 C01 D0222 E022 F02< G0<2 H0<2 159 1 39
8 A0222 C01 D0222 E022 F02< G0<< H0<< 140 1 31
9 A0222 C01 D0222 E022 F02F G022 H022 275 2 45
10 A0222 C01 D0222 E022 F02F G02< H02< 274 2 42
11 A0222 C01 D0222 E022 F02F G0<2 H0<2 156 1 35
12 A0222 C01 D0222 E022 F02F G0<< H0<< 145 1 35
13 A0222 C01 D0222 E022 F02P G022 H022 303 2 43
14 A0222 C01 D0222 E022 F02P G02< H02< 301 2 46
15 A0222 C01 D0222 E022 F02P G0<2 H0<2 146 1 36
16 A0222 C01 D0222 E022 F02P G0<< H0<< 173 1 40
WORKING CODE
============
group
by r.office
,r.country
,r.office_name
,r.car
,r.car_name
,r.residue
,r.residuename
FULL SOLUTION
=============
libname spde spde
('c:\wrk\spde_c','d:\wrk\spde_d','e:\wrk\spde_e','g:\wrk\spde_g','h:\wrk\spde_h')
metapath =('c:\wrk\spde_c\metadata')
indexpath=(
'c:\wrk\spde_c'
,'d:\wrk\spde_d'
,'e:\wrk\spde_e'
,'g:\wrk\spde_g'
,'h:\wrk\spde_h')
datapath =(
'c:\wrk\spde_c'
,'d:\wrk\spde_d'
,'e:\wrk\spde_e'
,'g:\wrk\spde_g'
,'h:\wrk\spde_h')
partsize=500m
;
proc datasets lib=spde kill;
run;quit;
* CREATE INPUT;
data spde.dmperg160m (drop=ran: rec sortedby=country index=(country));
retain sta_id 0;
length
office $5
country $3
office_name $5
car $4
car_name $4
residue $4
residuename $4
erg_id $4;
do rec=1 to 40000000;
do country='C01','C02','C02','C04';
if mod(rec,10) = 0 then sta_id=sta_id+1;
ran=int(100*uniform(5731));
ran1=int(100*uniform(5731));
ran2=int(100*uniform(5731));
year=put(2003+mod(ran,3),4.);
office = cats('A0',byte(mod(ran,10)+50),byte(mod(ran1,10)+50),byte(mod(ran2,10)+50));
office_name = cats('D0',byte(mod(ran,10)+50),byte(mod(ran1,10)+50),byte(mod(ran2,10)+50));
car = cats('E0',byte(mod(ran,10)+50),byte(mod(ran1,10)+50));
car_name = cats('F0',byte(mod(ran,10)+50),byte(mod(ran1,50)+50));
residue = cats('G0',byte(mod(ran,20)+50),byte(mod(ran1,20)+50));
residuename = cats('H0',byte(mod(ran,20)+50),byte(mod(ran1,20)+50));
erg_id = put(mod(ran,9999),4.);
output;
end;
end;
run;quit;
/*
NOTE: The data set SPDE.DMPERG160M has 160000000 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 4:47.34
user cpu time 7:01.88
system cpu time 4:11.58
memory 63209.31k
OS Memory 89692.00k
Timestamp 12/09/2016 10:24:40 AM
Step Count 145 Switch Count 9766
*/
data spde.dmpsta17m(index=(sta_id));
retain sta_id 0;
length cartype_id $2;
do sta_id=1 to 6000000;
cartype_id = cats(byte(mod(sta_id,50)+50),byte(mod(sta_id,50)+50));
output;
end;
run;quit;
/*
NOTE: The data set SPDE.DMPSTA17M has 6000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 3.54 seconds
cpu time 5.67 seconds
1280! quit;
*/
proc sql;
create
table sd1.res as
select
r.office
,r.country
,r.office_name
,r.car
,r.car_name
,r.residue
,r.residuename
,count(distinct l.sta_id) as sta_ids
,count (distinct r.erg_id) as erg_ids
,count(distinct l.cartype_ID) as cartype_IDs
from
spde.dmpsta17m as l
,spde.dmperg160m as r
where
r.year='2005'
and l.sta_id = r.sta_id
group
by r.office
,r.country
,r.office_name
,r.car
,r.car_name
,r.residue
,r.residuename
;quit;
NOTE: Table SD1.RES created, with 60000 rows and 10 columns.
261 ! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 2:03.27
cpu time 3:19.24
... View more