Hello, I am a new user and running into issues trying to perform left joins on a large data sets. The situation overview is as follows. I have three data sets (will be more eventually, but the later joins should be easier) call them: Data1: Has millions of rows but I can subset this easily to look at just the observations that are relevant will still have around 6 million observations. Has a unique ID to join on. Data2: Has millions of rows and has a variable (=need) that I want to join to data1. Does not have unique ID, instead has ID2 that relates to ID. Data3: Has millions of rows and has ID2 and ID. There may be multiple ID2s for each ID. Right now my plan is to subset data1 so that the table is as small as possible and then do two left joins to correlate ID to ID2 and finally correlate my need to id. Once I have that I was going to run a proc summary to get just the max need for each unique id. My code currently looks like: data data0;
set data1 (keep = id {other vars needed})
where {conditions};
run;
proc sql; create table joined_data as
select A.id, A.{other vars}, B.id2, C.need
from data0 as A left join data2 as B
on A.id=B.id
left join data3 as C
on B.id2=C.id2
quit;
So basically, I want to relate id2 to id and then my need to id. Once I have that, I run proc summary on class variable id with max need, to create a single id together with the largest need listed. My problem is that when I was testing on a obs of 1000 for data0 and then allowing max obs to do the joins, the 1000 observations turned into 31,000 after the joins and took 15 minutes to complete. There are that many ID2s relating to each ID. I'm afraid the opening up to the full ~6 million IDs could make this unfeasible to process. I've looked into indexes but they are not allowed on the raw data sets. I would need to copy to a work library and just copying data2 or data3 is a very lengthy process because of their size. I'm not sure the indexes would save enough time to justify the time to copy the sets and create indexes. I wondering if I'm going about this wrong and there are much easier ways to accomplish what I want? Thanks in advance to any suggestions you all have!
... View more