"My SAS is still running "proc sort nodupkey; by id codes date;"
Hmm that's worrying. Not knowing your data and your infrastructure is making it hard. Anyways, for the hash code to work, both the temp and small should have the same IDs. Please make sure you are mindful about that. Have a good day. Off for coffee!
I'm using proc sql-inner join to start with. H folder has 1.88TB free space while Z folder has only 465GB. I'm using H to output a resulting table from SQL. Hope this helps. But @Astounding pointed out that I needed more space in H. Let's see what happens.
proc sql;
create table h.want as
select a.id,codes,date
from z.small a inner join z.large b
on a.id=b.id
order by id,codes ,date;
quit;
Good. The people who have offered advice are highly experienced to the tune of my 5+ years*N number of their years. So I truly trust them. My fun here is pretty much to do with solving with coding akin to a video game that i treat this like Xbox or PlayStation. So from the point of infrastructure, administration, capacity planning, contingency planning and beyond, you are right that your best bet is to seek and receive advice from wise and senior people like @Astounding , @Kurt_Bremser , @Ksharp , @Patrick etc.
I shamelessly admit I am useless in that regard.
SQL joins like that are VERY intensive on the WORK library, so you might run out of space there. If WORK shares its physical location with library H, a crash is pre-programmed.
That's why I suggested a simple data step merge.
And if you have non-uniqueness in both input datasets, the SQL will create a cartesian join and cause a guaranteed crash.
@Kurt_Bremser I thought too. but not that bad. SQL outputted N-Rows=108,956,917 and N_uniq_id=181,978. However, I have only selected three variables because I didn't believe this could output a table. But it did. Now I'm running the same SQL again with select *. There are green warning messages below. Is it safe to ignore these warnings?
WARNING: Column named ID is duplicated in a select expression (or a view).
Explicit references to it will be to the first one.
WARNING: Variable obligation already exists on file H.MERGED.
Show us your SQL code plz
small has "id" and other 20 variables which will be used in the analysis
large has only "id", "codes" and "date"
proc sql;
create table H.merged as
select *
from r.small a inner join R.large b
on a.id=b.id
order by id,codes,date;
quit;
Believing your small as unique ID as discussed before .that is each obs has distinct ID
proc sql;
create table H.merged as
select a.*,codes,date
from r.small a inner join R.large b
on a.id=b.id
order by a.id,codes,date;
quit;
This assumed codes and date are not in small
Well, the order of Proc SQL output is never guaranteed and so that pretty much warrants an order by clause to have your output ordered based on a group unless otherwise desired or if order doesn't matter.
I generally apply the order by clause whenever I use Proc SQL in the objective that this output may serve a purpose that would go as input for further processing where we might be using a datastep DOW/By Group processing etc. And In general, SAS works best when have you ordered long data as opposed to unordered wide data. This is thumb rule!
@Cruise But what is the benefit of order by a.id? instead just id?
That is to avoid ambiguous reference as Sql processor doesn't know which ID column to select as it exists in both tables
@Cruise wrote:
@Kurt_Bremser I thought too. but not that bad. SQL outputted N-Rows=108,956,917 and N_uniq_id=181,978. However, I have only selected three variables because I didn't believe this could output a table. But it did. Now I'm running the same SQL again with select *. There are green warning messages below. Is it safe to ignore these warnings?
WARNING: Column named ID is duplicated in a select expression (or a view).
Explicit references to it will be to the first one.WARNING: Variable obligation already exists on file H.MERGED.
That's what happens when you use the asterisk. You need to be specific as to what columns from where should be included in the target dataset.
Thanks everybody: Mission accomplished.
NOTE: Table H.MERGED created, with 108956917 rows and 108 columns.
31 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 36:31.60
cpu time 4:17.32
36 minutes is quite OK for such sizes. It would be great if you posted your final successful code, as a reference for people who find this thread in a search.
for future reference. final success code is below which was written by novinosirin.
proc sql;
create table H.merged as
select a.*,codes,date
from r.small a inner join R.large b
on a.id=b.id
order by a.id,codes,date;
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.