Re: Need help joining large data sets

Newuser999 · Posted 05-03-2021 05:58 PM

Hello,

I am a new user and running into issues trying to perform left joins on a large data sets. The situation overview is as follows. I have three data sets (will be more eventually, but the later joins should be easier) call them:

Data1: Has millions of rows but I can subset this easily to look at just the observations that are relevant will still have around 6 million observations. Has a unique ID to join on.

Data2: Has millions of rows and has a variable (=need) that I want to join to data1. Does not have unique ID, instead has ID2 that relates to ID.

Data3: Has millions of rows and has ID2 and ID. There may be multiple ID2s for each ID.

Right now my plan is to subset data1 so that the table is as small as possible and then do two left joins to correlate ID to ID2 and finally correlate my need to id. Once I have that I was going to run a proc summary to get just the max need for each unique id. My code currently looks like:

data data0;
set data1 (keep = id {other vars needed})
where {conditions};
run;

proc sql;
create table joined_data as
select A.id, A.{other vars}, B.id2, C.need
from data0 as A left join data2 as B
on A.id=B.id
left join data3 as C
on B.id2=C.id2
quit;

So basically, I want to relate id2 to id and then my need to id. Once I have that, I run proc summary on class variable id with max need, to create a single id together with the largest need listed.

My problem is that when I was testing on a obs of 1000 for data0 and then allowing max obs to do the joins, the 1000 observations turned into 31,000 after the joins and took 15 minutes to complete. There are that many ID2s relating to each ID. I'm afraid the opening up to the full ~6 million IDs could make this unfeasible to process.

I've looked into indexes but they are not allowed on the raw data sets. I would need to copy to a work library and just copying data2 or data3 is a very lengthy process because of their size. I'm not sure the indexes would save enough time to justify the time to copy the sets and create indexes.

I wondering if I'm going about this wrong and there are much easier ways to accomplish what I want? Thanks in advance to any suggestions you all have!

ballardw · Posted 05-03-2021 07:52 PM

You may need to consider a bit more closely how you want to join the values. Each duplicate value creates a match.

So if you have 3 of one value of x in one set and 2 values in the other and do a left join you end up with 6 (3*2) records with that value.

Example:

data one;
   input x y;
datalines;
1 1
1 2
1 3
2 1
2 2
;
data two; 
   input x z;
datalines;
1 11
1 22
2 11
2 22
;

proc sql;
   select a.* , b.z
   from one as a 
        left join
        two as b
        on a.x=b.x
   ;
quit;

You may want to consider whether some other variables might be suitable to reduce the duplicates, such as a date, geographic location, product code or similar that appears in pairs of the data sets.

Something else to consider is to split the query into two pieces. Do the join for A and B and let that complete.

You may want to consider if you need a Distinct to reduce duplicate records on that query. The result would then be used to join with C as a separate query, again considering what you need.

Subsetting data as early as practical may help by reducing the number of variables. If you only need the value if ID2 from B then perhaps

from data0 as A left join (select distinct id, id2 from data2) as B

which will reduce the number of duplicate id2 values that might come in from B as well as reducing the number of things pushed around in the back ground as well.

Similar for C just get the values you need and Select distinct to reduce duplication.

Note: Distinct by itself is a time consuming instruction but if you don't expect 31,000 records it may get you down to something closer to 3000.

Sort of a generic answer as I don't have any data to work with.

Basically if you need all the results you're going to eat the time and disk space. If you don't, then you need to figure out one or more ways to reduce the records at each step.

Ksharp · Posted 05-04-2021 06:26 AM

Try Hash Table. And make the character variable 's length as less as possible , especially for the table loaded into Hash Table.

Kurt_Bremser · Posted 05-04-2021 07:46 AM

Do the summary on table3 first, by ID and ID2.

Then join. As long as you have cartesian joins throughout, they will kill you.

If you have a DBMS as a backend, and loading data from there into WORK takes a long time, consider unloading the process to the DB with explicit pass-through.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX