In table A, I have ID, gender, age, etc demographics info;
In table B, I have ID, order_datetime, order_amount etc order details info.
Table A has more IDs than table B (many IDs never put an order), and in table B each ID may have multiple orders.
Now I want to merge them, to have a new table, which has all info from Table A, and also order_datetime from Table B.
In this case, shall I use the left join in proc sql?
proc sql;
create table demo_order as
select a.ID, a.gender, a.age, b.order_date
from demographics as a
left join
order_details as b
on a.ID eq b.ID;
quit;
Then, I will calculate how many orders one ID made per day. Can I use the following code?
proc sql;
create table order_per_day as
select ID, order_date,
count(*) as num_orders
from demo_order
group by ID, order_date;
quit;
I also want to find out which IDs never put an order, and count how many. Can I count directly in the proc sql step?
proc sql;
create table never_ordered as
select ID, age, gender
from demo_order
where order_date is missing;
quit;
Use a subset or generate sample data to test on.
A good way to do this is to use OBS=10000 say to restrict your data to smaller sizes and test your results.
option obs=100000;
*run tests;
option obs=max;
The reality is these are business decisions to a point, the code may be correct but it may not answer your question.
Regarding your questions:
Do you really want a left join or a full join? You want a left join if you want all ID's to be included even if they didn't order. You want a full join if you only want IDs with orders.
If you have multiple orders per ID, then yes you will have a table larger than your original, and you can't depend on the counts to verify your merge. This is why it's important to generate some test cases and ensure you're getting what you expect.
Your code looks correct as far as I can see.
Out of curiousity, why can't you test your code?
data set is very large. I ran the first step, and there was no error.The new table has many many more rows than the original table A. I guess this is due to some IDs have multiple orders.
I want to make sure I did it correctly before moving forward.
Use a subset or generate sample data to test on.
A good way to do this is to use OBS=10000 say to restrict your data to smaller sizes and test your results.
option obs=100000;
*run tests;
option obs=max;
The reality is these are business decisions to a point, the code may be correct but it may not answer your question.
Regarding your questions:
Do you really want a left join or a full join? You want a left join if you want all ID's to be included even if they didn't order. You want a full join if you only want IDs with orders.
If you have multiple orders per ID, then yes you will have a table larger than your original, and you can't depend on the counts to verify your merge. This is why it's important to generate some test cases and ensure you're getting what you expect.
Your code looks correct as far as I can see.
Can you post some example test data - in the form of a datastep - and what you want the output to look like. I can't really provide any code here without something to work with, however I will say the following:
Why do you join a large dataset, to another large dataset, and then from that start summarizing the data? This is most process intensive. Why do you not summarize the order dataset first, and then merge the summarized data onto the other dataset?
Also, have you tried doing this directly in one datastep? Perfomance may be better by doing it that way, and from the logic I don't see anything too difficult. Provide some test data and can have a look.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.