Hi all,
I have come across a situation where i need to merge two transactional datasets. One DS has about 500,000 observations and has ACCT_IND as primary key. The second dataset B has about 750 million observations. I need to merge these based on ACCT_IND. Can someone suggest the best way to do it since if I merge them I will need to sort the datasets first and if I join them using sql it will be very huge.
Can someone demonstrate a better method to do this?
Can you explain in more detail what you need?
What type of merge do you need? For example are you using the ACCT_IND from the second dataset to lookup attributes of that account from the first dataset? Is it a one-to-one match ? Or a one-to-many?
Is this just for a report that you need to run or are you creating and/or updating a new table based on the merge?
What do you mean by "transactional" datasets? Do you mean that they change frequently? Do you need to re-run the whole dataset or can you just operate on the recrods that are changing? Or do you mean that the second datasets represents modifications (or transactions) to be applied to the first dataset?
Were are these dataset actually managed? Are they only being stored in SAS datasets or do you have a some type of database or data warehouse where the data really lives?
I am assuming that the smaller 500,000 observation dataset is unique at the ACCT_ID level.
The methodology is to split the larger files into 10 smaller datasets and append the above smaller dataset to each piece. To test the code, I added names to the 'small' dataset and transaction values to the 'large' dataset.
data small;
input ACCT_IND $ name $;
datalines;
111 jon
123 bob
124 mary
451 lisa
333 sue
;
run;
data large;
input ACCT_IND $ sales;
datalines;
111 10
111 15
111 13
123 18
123 16
123 17
124 19
124 22
451 11
451 8
451 17
451 15
333 14
333 12
333 26
333 22
333 25
;
run;
*Number of records in large dataset;
%let records = 17; *750000000;
*750 million record dataset - Split prior to merging to gain speed when merging;
data huge_dataset_0 huge_dataset_1 huge_dataset_2 huge_dataset_3 huge_dataset_4 huge_dataset_5 huge_dataset_6 huge_dataset_7 huge_dataset_8 huge_dataset_9;
set large;
if _n_ <= &records./10 then output huge_dataset_0;
else if _n_ <= 2/10*&records. then output huge_dataset_1;
else if _n_ <= 3/10*&records. then output huge_dataset_2;
else if _n_ <= 4/10*&records. then output huge_dataset_3;
else if _n_ <= 5/10*&records. then output huge_dataset_4;
else if _n_ <= 6/10*&records. then output huge_dataset_5;
else if _n_ <= 7/10*&records. then output huge_dataset_6;
else if _n_ <= 8/10*&records. then output huge_dataset_7;
else if _n_ <= 9/10*&records. then output huge_dataset_8;
else output huge_dataset_9;
run;
proc sort data = small;
by ACCT_IND;
run;
*Sort the large datasets and append the smaller dataset to each;
%macro sort(i);
proc sort data = huge_dataset_&i.;
by ACCT_IND;
run;
data large_&i.;
merge small ( in = a )
huge_dataset_&i. ( in = b );
by ACCT_IND;
if b then output;
run;
%mend;
%sort(i=0); %sort(i=1); %sort(i=2); %sort(i=3); %sort(i=4);
%sort(i=5); %sort(i=6); %sort(i=7); %sort(i=8); %sort(i=9);
*Bring the dataset back together;
data huge_dataset;
set large_0 large_1 large_2 large_3 large_4
large_5 large_6 large_7 large_8 large_9;
run;
As @Tom said, we need to know more, especially the type of join.
How many records do you expect in the output? 500k? 750m?
Is the key in A unique?
These are SAS data sets?
How many variables in each input dataset? How many variables do you want out?
None of the data sets is sorted?
Any indexes?
See response below.
How many records do you expect in the output? --- 500 k
Is the key in A unique? - Key(ACCT_NO) in A is unique and it has different transactions tied to it in B.
These are SAS data sets? - Yes
How many variables in each input dataset? How many variables do you want out? - Close to 10 variables in A, 950 in B. I want all from A and 3-4 from B
None of the data sets is sorted? - Nope.
Any indexes? - Nope
This merge ran in 5 min on my machine.
data A(keep=KEY A1-A10)
B(keep=KEY B1-B10);
length A1-A10 B1-B10 8;
do I= 1 to 750e6;
KEY=ranuni(0);
if KEY < 500 / 750000 then output A;
output B;
end;
run;
data WANT;
if _N_=1 then do;
dcl hash A(dataset:'A');
A.definekey('KEY');
A.definedata('A1','A2','A3','A4','A5','A6','A7','A8','A9','A10');
A.definedone();
if 0 then set A;
end;
set B(keep=KEY B1 B2);
RC=A.find();
if RC=0;
run;
NOTE: There were 500029 observations read from the data set WORK.A.
NOTE: There were 750000000 observations read from the data set WORK.B.
NOTE: The data set WORK.WANT has 500029 observations and 14 variables.
NOTE: DATA statement used (Total process time):
real time 5:48.32
user cpu time 5:10.08
system cpu time 37.69 seconds
memory 71702.09k
OS Memory 98908.00k
Timestamp 23/06/2017 10:57:57 AM
Step Count 4411 Switch Count 665
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.