BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fengyuwuzu
Pyrite | Level 9

In table A, I have ID, gender, age, etc demographics info;

In table B, I have ID, order_datetime, order_amount etc order details info.

 

Table A has more IDs than table B (many IDs never put an order), and in table B each ID may have multiple orders.

 

Now I want to merge them, to have a new table, which has all info from Table A, and also order_datetime from Table B.

In this case, shall I use the left join in proc sql?

 

proc sql;
create table demo_order as
select a.ID, a.gender, a.age, b.order_date 
from demographics as a 
     left join 
     order_details as b
     on a.ID eq b.ID;
quit;

 

Then, I will calculate how many orders one ID made per day. Can I use the following code?

 

proc sql;
create table order_per_day as
select ID, order_date,
       count(*) as num_orders
from demo_order
group by ID, order_date;
quit;

 

I also want to find out which IDs never put an order, and count how many. Can I count directly in the proc sql step?

 

proc sql;
create table never_ordered as
select ID, age, gender 
from demo_order
where order_date is missing;
quit;

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Use a subset or generate sample data to test on.

 

A good way to do this is to use OBS=10000 say to restrict your data to smaller sizes and test your results.  

option obs=100000;

*run tests;



option obs=max;

 

 

The reality is these are business decisions to a point, the code may be correct but it may not answer your question.

 

Regarding your questions:

Do you really want a left join or a full join? You want a left join if you want all ID's to be included even if they didn't order. You want a full join if you only want IDs with orders.

 

If you have multiple orders per ID, then yes you will have a table larger than your original, and you can't depend on the counts to verify your merge. This is why it's important to generate some test cases and ensure you're getting what you expect.

 

Your code looks correct as far as I can see.

 

View solution in original post

6 REPLIES 6
Reeza
Super User

Out of curiousity, why can't you test your code?

fengyuwuzu
Pyrite | Level 9

data set is very large. I ran the first step, and there was no error.The new table has many many more rows than the original table A. I guess this is due to some IDs have multiple orders.

I want to make sure I did it correctly before moving forward.

Reeza
Super User

Use a subset or generate sample data to test on.

 

A good way to do this is to use OBS=10000 say to restrict your data to smaller sizes and test your results.  

option obs=100000;

*run tests;



option obs=max;

 

 

The reality is these are business decisions to a point, the code may be correct but it may not answer your question.

 

Regarding your questions:

Do you really want a left join or a full join? You want a left join if you want all ID's to be included even if they didn't order. You want a full join if you only want IDs with orders.

 

If you have multiple orders per ID, then yes you will have a table larger than your original, and you can't depend on the counts to verify your merge. This is why it's important to generate some test cases and ensure you're getting what you expect.

 

Your code looks correct as far as I can see.

 

fengyuwuzu
Pyrite | Level 9
Thank you very much for your prompt replies, and your suggestions. Yes, it is a good idea to use a small portion of the data set to test.

when you said: "You want a full join if you only want IDs with orders." -- it should be inner join, not full join, right?
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Can you post some example test data - in the form of a datastep - and what you want the output to look like.  I can't really provide any code here without something to work with, however I will say the following:

Why do you join a large dataset, to another large dataset, and then from that start summarizing the data?  This is most process intensive.  Why do you not summarize the order dataset first, and then merge the summarized data onto the other dataset?  

Also, have you tried doing this directly in one datastep?  Perfomance may be better by doing it that way, and from the logic I don't see anything too difficult.  Provide some test data and can have a look.

fengyuwuzu
Pyrite | Level 9
Thank you, RW9. You are right. It would be more efficient to count the order before merge.

I also dropped many variables when I merge first, which was to reduce the processing time too.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 5694 views
  • 0 likes
  • 3 in conversation