Recently, I am working on prediction for one dataset(for a store). I want to use group_by function to conclude some attributes. But my goal is a little complicated, because I want to separate the data into April and May.
For each specific customer(CUSTOMER_ID):
The my desired outcome is like this, for each specific customer, I can conclude:
customer_id | gender | age | sub_num on April | sub_num on May | Quantity purchased on April | sum of "transaction_date" on April |
1 | ||||||
2 | ||||||
3 |
here is the coding I have
proc sql;
SELECT customer_ID, sum (Profit) as totalProfit, count (QUANTITY) as totalQuantity, count (TRANSACTION_DATE) as frequency,
From cus_tran
where TRANSACTION_DATE < "01MAY2014"d
Group By customer_ID;
quit;
I am confused, because I want to group each specific customers, and conclude several data only on April, one only on May, and combine those dataset in one table. I think where clause is not available here, if only focus on April data, the data is likely to become incomplete, thus I cannot got data for each customers. I spend a so hard time on it, can you please help me fix it? Thanks so much!
Let's translate the table headers into an SQL query.
select
ID
,sum(case when month(transaction_date)=4 then profit else . end)
as total_profit_only_in_April
,sum(case when month(transaction_date)=4 then quantity else . end)
as total_quantity_only_in_April
,count(case when month(transaction_date)=4 then quantity else . end)
as frequency_only_in_April
,count(case when month(transaction_date)=5 then quantity else . end)
as frequency_only_in_May
,max(gender) as gender
,max(age) as age
from cus_tran
group by id
;
Can you please post the data you have in usable form? This is a data step using datalines.
@Jianan_luna wrote:
Sorry, can you please let me know what kind of data type you are looking
for? I didn’t make sense
To be able to suggest something useful, i need to see data you have.
Excel spreadsheets are mostly useless in representing SAS datasets. Excel does not have the concept of fixed columns attributes, so we can not infer types, sizes, and formats from an Excel spreadsheet.
On top of that, many corporate environments block the download of Office files in their proxies and firewalls, because of security aspects. MS Office files are the #1 vector for malware attacks.
A data step with datalines, on the other hand, is just text and therefore no security concern at all. And it takes just a simple copy/paste of the code to our program editor and a submit to create an exact replica of your dataset, while Excel files force us to download and then import them, with quite unreliable and unpredictable results.
Creating such a data step with datalines is no rocket science (even people of limited intelligence, like myself, have mastered it), but a very useful SAS skill that SAS newcomers should learn to use right at the beginning. Creating fake data for testing is a necessary tool for your SAS toolbox.
Thanks so much Sir, I think I got it. I create a code like this, please check it
data cus_tran;
input ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE $10. QUANTITY Profit;
Datalines;
1 10-19 F 0 2014-05-06 1 13
1 10-19 F 1 2014-05-06 1 13
1 10-19 F 1 2014-04-10 2 22
2 10-19 M 1 2014-04-11 2 10
3 30-40 F 1 2014-04-07 1 10
4 10-19 F 0 2014-04-08 1 10
3 30-40 F 0 2014-04-22 1 20
5 20-30 M 1 2014-04-30 1 20
2 10-19 M 0 2014-05-01 1 20
3 30-40 F 0 2014-05-06 1 10
4 10-19 F 1 2014-05-10 3 20
4 10-19 F 0 2014-04-30 1 20
5 20-30 M 1 2014-05-10 1 20
2 10-19 M 0 2014-05-10 1 30
3 30-40 F 0 2014-05-01 1 10
2 10-19 M 1 2014-04-01 1 10
3 30-40 F 1 2014-05-01 1 20
;
RUN;
So we are interested in dates of April and May 2014. The WHERE should look like this
where '01apr2014'd le transaction_date le '31may2014'd
Since you want age and gender in the result, you need to have them in the GROUP BY
group by customer_id, age, gender
To build a conditional sum, use a combination of the summary function and CASE
sum(case when transaction_date lt '01may2014'd then sub_num else 0 end) as sub_num_april
From these blocks, you should be able to build your query.
Thanks so much, here is the dataline I created, please check:
data cus_tran;
input ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE $10. QUANTITY Profit;
Datalines;
1 10-19 F 0 2014-05-06 1 13
1 10-19 F 1 2014-05-06 1 13
1 10-19 F 1 2014-04-10 2 22
2 10-19 M 1 2014-04-11 2 10
3 30-40 F 1 2014-04-07 1 10
4 10-19 F 0 2014-04-08 1 10
3 30-40 F 0 2014-04-22 1 20
5 20-30 M 1 2014-04-30 1 20
2 10-19 M 0 2014-05-01 1 20
3 30-40 F 0 2014-05-06 1 10
4 10-19 F 1 2014-05-10 3 20
4 10-19 F 0 2014-04-30 1 20
5 20-30 M 1 2014-05-10 1 20
2 10-19 M 0 2014-05-10 1 30
3 30-40 F 0 2014-05-01 1 10
2 10-19 M 1 2014-04-01 1 10
3 30-40 F 1 2014-05-01 1 20
;
RUN;
I am using GROUP_BY to aggregate some variables. My goal is group by ID, then SUM the "Profit" only on April, COUNT the "Quantity" only on April, SUM the "SUB_NUM" only on May. I tried to find how to use IF function in aggregation, but I didn't find it. Could you please help me figure it out? Here is the dataset and my coding.
data cus_tran;
input ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE $10. QUANTITY Profit;
Datalines;
1 10-19 F 0 2014-05-06 1 13
1 10-19 F 1 2014-05-06 1 13
1 10-19 F 1 2014-04-10 2 22
2 10-19 M 1 2014-04-11 2 10
3 30-40 F 1 2014-04-07 1 10
4 10-19 F 0 2014-04-08 1 10
3 30-40 F 0 2014-04-22 1 20
5 20-30 M 1 2014-04-30 1 20
2 10-19 M 0 2014-05-01 1 20
3 30-40 F 0 2014-05-06 1 10
4 10-19 F 1 2014-05-10 3 20
4 10-19 F 0 2014-04-30 1 20
5 20-30 M 1 2014-05-10 1 20
2 10-19 M 0 2014-05-10 1 30
3 30-40 F 0 2014-05-01 1 10
2 10-19 M 1 2014-04-01 1 10
3 30-40 F 1 2014-05-01 1 20
;
RUN;
proc sql;
SELECT ID, sum (Profit) as totalProfit, count (QUANTITY) as totalQuantity,
sum (SUB_NUM) as frequency
From cus_tran
Group By ID;
quit;
This is my coding, but it includes all data from both April and May, can you please help me separate it? Thanks so much!
Sincerely,
Thanks
data cus_tran;
input ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE :yymmdd10. QUANTITY Profit;
format TRANSACTION_DATE yymmdd10.;
Datalines;
1 10-19 F 0 2014-05-06 1 13
1 10-19 F 1 2014-05-06 1 13
1 10-19 F 1 2014-04-10 2 22
2 10-19 M 1 2014-04-11 2 10
3 30-40 F 1 2014-04-07 1 10
4 10-19 F 0 2014-04-08 1 10
3 30-40 F 0 2014-04-22 1 20
5 20-30 M 1 2014-04-30 1 20
2 10-19 M 0 2014-05-01 1 20
3 30-40 F 0 2014-05-06 1 10
4 10-19 F 1 2014-05-10 3 20
4 10-19 F 0 2014-04-30 1 20
5 20-30 M 1 2014-05-10 1 20
2 10-19 M 0 2014-05-10 1 30
3 30-40 F 0 2014-05-01 1 10
2 10-19 M 1 2014-04-01 1 10
3 30-40 F 1 2014-05-01 1 20
;
RUN;
proc sql;
create table want as
SELECT ID,put(TRANSACTION_DATE,monyy7.) as month_year, sum (Profit) as totalProfit, count (QUANTITY) as totalQuantity,
sum (SUB_NUM) as frequency
From cus_tran
Group By ID,month_year;
quit;
Thanks so much, but when I run the coding, it shows errors in log. The error is like following: the TRANSACTION_DATE should be numeric, but actually it's a character type. Could you please help me fix it? Thanks so much again!
The code from @novinosrin changes the INPUT statement to make transaction_date numeric. In general, if something is a date (or time or date-time) then you really ought to input it as numeric. There is no value in making dates or times character and huge benefits to making it numeric.
Thanks so much Sir! I want to get the corresponding gender and age for every ID. I add gender and age after SELECT clause, but when I added it, then it the ID is not grouped at all. like this:
Is there any way to get the corresponding age and gender for everyone with Group_by function. Thanks so much. Here is my coding:
data cus_tran;
input ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE :yymmdd10. QUANTITY Profit;
format TRANSACTION_DATE yymmdd10.;
Datalines;
1 10-19 F 0 2014-05-06 1 13
1 10-19 F 1 2014-05-06 1 13
1 10-19 F 1 2014-04-10 2 22
2 10-19 M 1 2014-04-11 2 10
3 30-40 F 1 2014-04-07 1 10
4 10-19 F 0 2014-04-08 1 10
3 30-40 F 0 2014-04-22 1 20
5 20-30 M 1 2014-04-30 1 20
2 10-19 M 0 2014-05-01 1 20
3 30-40 F 0 2014-05-06 1 10
4 10-19 F 1 2014-05-10 3 20
4 10-19 F 0 2014-04-30 1 20
5 20-30 M 1 2014-05-10 1 20
2 10-19 M 0 2014-05-10 1 30
3 30-40 F 0 2014-05-01 1 10
2 10-19 M 1 2014-04-01 1 10
3 30-40 F 1 2014-05-01 1 20
;
RUN;
proc sql;
create table want as
SELECT ID,put(TRANSACTION_DATE,monyy7.) as month_year, sum (Profit) as totalProfit, count (QUANTITY) as totalQuantity,
sum (SUB_NUM) as frequency, GENDER, AGE
From cus_tran
Group By ID,GENDER,AGE,month_year;
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.