Solved: Re: proc sql: add one column from one table to another table and count

fengyuwuzu · Posted 02-22-2016 12:09 AM

In table A, I have ID, gender, age, etc demographics info;

In table B, I have ID, order_datetime, order_amount etc order details info.

Table A has more IDs than table B (many IDs never put an order), and in table B each ID may have multiple orders.

Now I want to merge them, to have a new table, which has all info from Table A, and also order_datetime from Table B.

In this case, shall I use the left join in proc sql?

proc sql;
create table demo_order as
select a.ID, a.gender, a.age, b.order_date 
from demographics as a 
     left join 
     order_details as b
     on a.ID eq b.ID;
quit;

Then, I will calculate how many orders one ID made per day. Can I use the following code?

proc sql;
create table order_per_day as
select ID, order_date,
       count(*) as num_orders
from demo_order
group by ID, order_date;
quit;

I also want to find out which IDs never put an order, and count how many. Can I count directly in the proc sql step?

proc sql;
create table never_ordered as
select ID, age, gender 
from demo_order
where order_date is missing;
quit;

Reeza · Posted 02-22-2016 12:27 AM

Use a subset or generate sample data to test on.

A good way to do this is to use OBS=10000 say to restrict your data to smaller sizes and test your results.

option obs=100000;

*run tests;



option obs=max;

The reality is these are business decisions to a point, the code may be correct but it may not answer your question.

Regarding your questions:

Do you really want a left join or a full join? You want a left join if you want all ID's to be included even if they didn't order. You want a full join if you only want IDs with orders.

If you have multiple orders per ID, then yes you will have a table larger than your original, and you can't depend on the counts to verify your merge. This is why it's important to generate some test cases and ensure you're getting what you expect.

Your code looks correct as far as I can see.

View solution in original post

Reeza · Posted 02-22-2016 12:13 AM

Out of curiousity, why can't you test your code?

fengyuwuzu · Posted 02-22-2016 12:18 AM

data set is very large. I ran the first step, and there was no error.The new table has many many more rows than the original table A. I guess this is due to some IDs have multiple orders.

I want to make sure I did it correctly before moving forward.

Reeza · Posted 02-22-2016 12:27 AM

Use a subset or generate sample data to test on.

A good way to do this is to use OBS=10000 say to restrict your data to smaller sizes and test your results.

option obs=100000;

*run tests;



option obs=max;

The reality is these are business decisions to a point, the code may be correct but it may not answer your question.

Regarding your questions:

Do you really want a left join or a full join? You want a left join if you want all ID's to be included even if they didn't order. You want a full join if you only want IDs with orders.

If you have multiple orders per ID, then yes you will have a table larger than your original, and you can't depend on the counts to verify your merge. This is why it's important to generate some test cases and ensure you're getting what you expect.

Your code looks correct as far as I can see.

fengyuwuzu · Posted 02-22-2016 12:41 AM

Thank you very much for your prompt replies, and your suggestions. Yes, it is a good idea to use a small portion of the data set to test.

when you said: "You want a full join if you only want IDs with orders." -- it should be inner join, not full join, right?

RW9 · Posted 02-22-2016 04:39 AM

Can you post some example test data - in the form of a datastep - and what you want the output to look like. I can't really provide any code here without something to work with, however I will say the following:

Why do you join a large dataset, to another large dataset, and then from that start summarizing the data? This is most process intensive. Why do you not summarize the order dataset first, and then merge the summarized data onto the other dataset?

Also, have you tried doing this directly in one datastep? Perfomance may be better by doing it that way, and from the logic I don't see anything too difficult. Provide some test data and can have a look.

fengyuwuzu · Posted 02-22-2016 09:51 AM

Thank you, RW9. You are right. It would be more efficient to count the order before merge.

I also dropped many variables when I merge first, which was to reduce the processing time too.

SAS Training: Just a Click Away