06-21-2017 03:11 PM
I have come across a situation where i need to merge two transactional datasets. One DS has about 500,000 observations and has ACCT_IND as primary key. The second dataset B has about 750 million observations. I need to merge these based on ACCT_IND. Can someone suggest the best way to do it since if I merge them I will need to sort the datasets first and if I join them using sql it will be very huge.
Can someone demonstrate a better method to do this?
06-21-2017 04:10 PM
Can you explain in more detail what you need?
What type of merge do you need? For example are you using the ACCT_IND from the second dataset to lookup attributes of that account from the first dataset? Is it a one-to-one match ? Or a one-to-many?
Is this just for a report that you need to run or are you creating and/or updating a new table based on the merge?
What do you mean by "transactional" datasets? Do you mean that they change frequently? Do you need to re-run the whole dataset or can you just operate on the recrods that are changing? Or do you mean that the second datasets represents modifications (or transactions) to be applied to the first dataset?
Were are these dataset actually managed? Are they only being stored in SAS datasets or do you have a some type of database or data warehouse where the data really lives?
06-21-2017 05:45 PM
I am assuming that the smaller 500,000 observation dataset is unique at the ACCT_ID level.
The methodology is to split the larger files into 10 smaller datasets and append the above smaller dataset to each piece. To test the code, I added names to the 'small' dataset and transaction values to the 'large' dataset.
data small; input ACCT_IND $ name $; datalines; 111 jon 123 bob 124 mary 451 lisa 333 sue ; run; data large; input ACCT_IND $ sales; datalines; 111 10 111 15 111 13 123 18 123 16 123 17 124 19 124 22 451 11 451 8 451 17 451 15 333 14 333 12 333 26 333 22 333 25 ; run; *Number of records in large dataset; %let records = 17; *750000000; *750 million record dataset - Split prior to merging to gain speed when merging; data huge_dataset_0 huge_dataset_1 huge_dataset_2 huge_dataset_3 huge_dataset_4 huge_dataset_5 huge_dataset_6 huge_dataset_7 huge_dataset_8 huge_dataset_9; set large; if _n_ <= &records./10 then output huge_dataset_0; else if _n_ <= 2/10*&records. then output huge_dataset_1; else if _n_ <= 3/10*&records. then output huge_dataset_2; else if _n_ <= 4/10*&records. then output huge_dataset_3; else if _n_ <= 5/10*&records. then output huge_dataset_4; else if _n_ <= 6/10*&records. then output huge_dataset_5; else if _n_ <= 7/10*&records. then output huge_dataset_6; else if _n_ <= 8/10*&records. then output huge_dataset_7; else if _n_ <= 9/10*&records. then output huge_dataset_8; else output huge_dataset_9; run; proc sort data = small; by ACCT_IND; run; *Sort the large datasets and append the smaller dataset to each; %macro sort(i); proc sort data = huge_dataset_&i.; by ACCT_IND; run; data large_&i.; merge small ( in = a ) huge_dataset_&i. ( in = b ); by ACCT_IND; if b then output; run; %mend; %sort(i=0); %sort(i=1); %sort(i=2); %sort(i=3); %sort(i=4); %sort(i=5); %sort(i=6); %sort(i=7); %sort(i=8); %sort(i=9); *Bring the dataset back together; data huge_dataset; set large_0 large_1 large_2 large_3 large_4 large_5 large_6 large_7 large_8 large_9; run;
06-21-2017 06:15 PM
As @Tom said, we need to know more, especially the type of join.
How many records do you expect in the output? 500k? 750m?
Is the key in A unique?
These are SAS data sets?
How many variables in each input dataset? How many variables do you want out?
None of the data sets is sorted?
06-22-2017 01:50 PM
See response below.
How many records do you expect in the output? --- 500 k
Is the key in A unique? - Key(ACCT_NO) in A is unique and it has different transactions tied to it in B.
These are SAS data sets? - Yes
How many variables in each input dataset? How many variables do you want out? - Close to 10 variables in A, 950 in B. I want all from A and 3-4 from B
None of the data sets is sorted? - Nope.
Any indexes? - Nope
06-22-2017 06:45 PM - edited 06-22-2017 07:05 PM
This merge ran in 5 min on my machine.
data A(keep=KEY A1-A10) B(keep=KEY B1-B10); length A1-A10 B1-B10 8; do I= 1 to 750e6; KEY=ranuni(0); if KEY < 500 / 750000 then output A; output B; end; run; data WANT; if _N_=1 then do; dcl hash A(dataset:'A'); A.definekey('KEY'); A.definedata('A1','A2','A3','A4','A5','A6','A7','A8','A9','A10'); A.definedone(); if 0 then set A; end; set B(keep=KEY B1 B2); RC=A.find(); if RC=0; run;
NOTE: There were 500029 observations read from the data set WORK.A.
NOTE: There were 750000000 observations read from the data set WORK.B.
NOTE: The data set WORK.WANT has 500029 observations and 14 variables.
NOTE: DATA statement used (Total process time):
real time 5:48.32
user cpu time 5:10.08
system cpu time 37.69 seconds
OS Memory 98908.00k
Timestamp 23/06/2017 10:57:57 AM
Step Count 4411 Switch Count 665