Solved: DATA MERGE on large datasets but SAS runs out of memory - Page 4

novinosrin · Posted 04-24-2019 09:30 AM

"My SAS is still running "proc sort nodupkey; by id codes date;"

Hmm that's worrying. Not knowing your data and your infrastructure is making it hard. Anyways, for the hash code to work, both the temp and small should have the same IDs. Please make sure you are mindful about that. Have a good day. Off for coffee!

Cruise · Posted 04-24-2019 09:34 AM

Enjoy your coffee. Thank for using the word "worrying". It's interesting my worry goes down when someone shows an understanding.

Cruise · Posted 04-24-2019 11:15 AM

@novinosrin

I'm using proc sql-inner join to start with. H folder has 1.88TB free space while Z folder has only 465GB. I'm using H to output a resulting table from SQL. Hope this helps. But @Astounding pointed out that I needed more space in H. Let's see what happens.

proc sql;
create table h.want as
select a.id,codes,date
from z.small a inner join z.large b
on a.id=b.id
order by id,codes ,date;
quit;

novinosrin · Posted 04-24-2019 11:26 AM

Good. The people who have offered advice are highly experienced to the tune of my 5+ years*N number of their years. So I truly trust them. My fun here is pretty much to do with solving with coding akin to a video game that i treat this like Xbox or PlayStation. So from the point of infrastructure, administration, capacity planning, contingency planning and beyond, you are right that your best bet is to seek and receive advice from wise and senior people like @Astounding , @Kurt_Bremser , @Ksharp , @Patrick etc.

I shamelessly admit I am useless in that regard.

Kurt_Bremser · Posted 04-24-2019 11:44 AM

SQL joins like that are VERY intensive on the WORK library, so you might run out of space there. If WORK shares its physical location with library H, a crash is pre-programmed.

That's why I suggested a simple data step merge.

And if you have non-uniqueness in both input datasets, the SQL will create a cartesian join and cause a guaranteed crash.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Cruise · Posted 04-24-2019 11:53 AM

@Kurt_Bremser I thought too. but not that bad. SQL outputted N-Rows=108,956,917 and N_uniq_id=181,978. However, I have only selected three variables because I didn't believe this could output a table. But it did. Now I'm running the same SQL again with select *. There are green warning messages below. Is it safe to ignore these warnings?

WARNING: Column named ID is duplicated in a select expression (or a view).
Explicit references to it will be to the first one.

WARNING: Variable obligation already exists on file H.MERGED.

novinosrin · Posted 04-24-2019 11:58 AM

Show us your SQL code plz

Cruise · Posted 04-24-2019 12:02 PM

@novinosrin , @Kurt_Bremser

small has "id" and other 20 variables which will be used in the analysis

large has only "id", "codes" and "date"

proc sql;
create table H.merged as
select *
from r.small a inner join R.large b
on a.id=b.id
order by id,codes,date;
quit;

novinosrin · Posted 04-24-2019 12:04 PM

Believing your small as unique ID as discussed before .that is each obs has distinct ID

proc sql;
create table H.merged as
select a.*,codes,date
from r.small a inner join R.large b
on a.id=b.id
order by a.id,codes,date;
quit;

This assumed codes and date are not in small

Cruise · Posted 04-24-2019 12:19 PM

GREAT. No warning message now. I understand the logic. But what is the benefit of order by a.id? instead just id? is it to save time? also to avoid duplicate queries?

novinosrin · Posted 04-24-2019 12:27 PM

Well, the order of Proc SQL output is never guaranteed and so that pretty much warrants an order by clause to have your output ordered based on a group unless otherwise desired or if order doesn't matter.

I generally apply the order by clause whenever I use Proc SQL in the objective that this output may serve a purpose that would go as input for further processing where we might be using a datastep DOW/By Group processing etc. And In general, SAS works best when have you ordered long data as opposed to unordered wide data. This is thumb rule!

@Cruise But what is the benefit of order by a.id? instead just id?

That is to avoid ambiguous reference as Sql processor doesn't know which ID column to select as it exists in both tables

Kurt_Bremser · Posted 04-24-2019 12:08 PM

@Cruise wrote:

@Kurt_Bremser I thought too. but not that bad. SQL outputted N-Rows=108,956,917 and N_uniq_id=181,978. However, I have only selected three variables because I didn't believe this could output a table. But it did. Now I'm running the same SQL again with select *. There are green warning messages below. Is it safe to ignore these warnings?

WARNING: Column named ID is duplicated in a select expression (or a view).
Explicit references to it will be to the first one.

WARNING: Variable obligation already exists on file H.MERGED.

That's what happens when you use the asterisk. You need to be specific as to what columns from where should be included in the target dataset.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Cruise · Posted 04-24-2019 01:52 PM

Thanks everybody: Mission accomplished.

NOTE: Table H.MERGED created, with 108956917 rows and 108 columns.

31   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           36:31.60
      cpu time            4:17.32

Kurt_Bremser · Posted 04-24-2019 02:49 PM

36 minutes is quite OK for such sizes. It would be great if you posted your final successful code, as a reference for people who find this thread in a search.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Cruise · Posted 04-24-2019 02:51 PM

@Kurt_Bremser @novinosrin

for future reference. final success code is below which was written by novinosirin.

proc sql;
create table H.merged as
select a.*,codes,date
from r.small a inner join R.large b
on a.id=b.id
order by a.id,codes,date;
quit;

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Re: DATA MERGE on large datasets but SAS runs out of memory

Classroom Training Available!