BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Jianan_luna
Obsidian | Level 7

Recently, I am working on prediction for one dataset(for a store). I want to use group_by function to conclude some attributes. But my goal is a little complicated, because I want to separate the data into April and May.

For each specific customer(CUSTOMER_ID):

  • Profit, AGE, GENDER for each customer in April
  • QUANTITY purchased by each customer in April
  • I count TRANSACTION_DATE to find the how many times each customers come to store in April
  • SUB_NUM means whether the customers purchase certain product, I use 1 represent the customer purchased it, 0 represents they don't. Here, my goal is: for each customer, did they purchase it on April, and did they purchase it on May, so basically, it's two columns. Here is the most confused point.

The my desired outcome is like this, for each specific customer, I can conclude:

customer_id gender age sub_num on April sub_num on May Quantity purchased on April sum of "transaction_date" on April
1            
2            
3            

here is the coding I have

proc sql;
	SELECT  customer_ID, sum (Profit) as totalProfit, count (QUANTITY) as totalQuantity,  count (TRANSACTION_DATE) as frequency,  
	From cus_tran
	where TRANSACTION_DATE < "01MAY2014"d
	Group By customer_ID;
quit;

I am confused, because I want to group each specific customers, and conclude several data only on April, one only on May, and combine those dataset in one table. I think where clause is not available here, if only focus on April data, the data is likely to become incomplete, thus I cannot got data for each customers. I spend a so hard time on it, can you please help me fix it? Thanks so much!

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Let's translate the table headers into an SQL query.

select 
   ID
  ,sum(case when month(transaction_date)=4 then profit else . end)
     as total_profit_only_in_April
  ,sum(case when month(transaction_date)=4 then quantity else . end)
     as total_quantity_only_in_April
  ,count(case when month(transaction_date)=4 then quantity else . end)
     as frequency_only_in_April
  ,count(case when month(transaction_date)=5 then quantity else . end)
     as frequency_only_in_May
  ,max(gender) as gender
  ,max(age) as age
from cus_tran
group by id 
;

View solution in original post

22 REPLIES 22
andreas_lds
Jade | Level 19

Can you please post the data you have in usable form? This is a data step using datalines.

Jianan_luna
Obsidian | Level 7
Sorry, can you please let me know what kind of data type you are looking
for? I didn’t make sense
andreas_lds
Jade | Level 19

@Jianan_luna wrote:
Sorry, can you please let me know what kind of data type you are looking
for? I didn’t make sense

To be able to suggest something useful, i need to see data you have.

Jianan_luna
Obsidian | Level 7
I post the attachment “cus_tran_sas.xlsx” under my question. Please check it
Kurt_Bremser
Super User

Excel spreadsheets are mostly useless in representing SAS datasets. Excel does not have the concept of fixed columns attributes, so we can not infer types, sizes, and formats from an Excel spreadsheet.

On top of that, many corporate environments block the download of Office files in their proxies and firewalls, because of security aspects. MS Office files are the #1 vector for malware attacks.

A data step with datalines, on the other hand, is just text and therefore no security concern at all. And it takes just a simple copy/paste of the code to our program editor and a submit to create an exact replica of your dataset, while Excel files force us to download and then import them, with quite unreliable and unpredictable results.

Creating such a data step with datalines is no rocket science (even people of limited intelligence, like myself, have mastered it), but a very useful SAS skill that SAS newcomers should learn to use right at the beginning. Creating fake data for testing is a necessary tool for your SAS toolbox.

Jianan_luna
Obsidian | Level 7

Thanks so much Sir, I think I got it. I create a code like this, please check it


data cus_tran;
	input  ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE $10. QUANTITY Profit;
	Datalines;
	1 10-19 F 0 2014-05-06 1 13
	1 10-19	F 1 2014-05-06 1 13
	1 10-19 F 1 2014-04-10 2 22
	2 10-19 M 1 2014-04-11 2 10
	3 30-40 F 1 2014-04-07 1 10
	4 10-19 F 0 2014-04-08 1 10
	3 30-40 F 0 2014-04-22 1 20
	5 20-30 M 1 2014-04-30 1 20
	2 10-19 M 0 2014-05-01 1 20
	3 30-40 F 0 2014-05-06 1 10
	4 10-19 F 1 2014-05-10 3 20
	4 10-19 F 0 2014-04-30 1 20
	5 20-30 M 1 2014-05-10 1 20
	2 10-19 M 0 2014-05-10 1 30
	3 30-40 F 0 2014-05-01 1 10
	2 10-19 M 1 2014-04-01 1 10
	3 30-40 F 1 2014-05-01 1 20
	;
RUN;
Kurt_Bremser
Super User

So we are interested in dates of April and May 2014. The WHERE should look like this

where '01apr2014'd le transaction_date le '31may2014'd

Since you want age and gender in the result, you need to have them in the GROUP BY

group by customer_id, age, gender

To build a conditional sum, use a combination of the summary function and CASE

sum(case when transaction_date lt '01may2014'd then sub_num else 0 end) as sub_num_april

From these blocks, you should be able to build your query.

Jianan_luna
Obsidian | Level 7
Thanks so much!
Jianan_luna
Obsidian | Level 7

Thanks so much, here is the dataline I created, please check:


data cus_tran;
	input  ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE $10. QUANTITY Profit;
	Datalines;
	1 10-19 F 0 2014-05-06 1 13
	1 10-19	F 1 2014-05-06 1 13
	1 10-19 F 1 2014-04-10 2 22
	2 10-19 M 1 2014-04-11 2 10
	3 30-40 F 1 2014-04-07 1 10
	4 10-19 F 0 2014-04-08 1 10
	3 30-40 F 0 2014-04-22 1 20
	5 20-30 M 1 2014-04-30 1 20
	2 10-19 M 0 2014-05-01 1 20
	3 30-40 F 0 2014-05-06 1 10
	4 10-19 F 1 2014-05-10 3 20
	4 10-19 F 0 2014-04-30 1 20
	5 20-30 M 1 2014-05-10 1 20
	2 10-19 M 0 2014-05-10 1 30
	3 30-40 F 0 2014-05-01 1 10
	2 10-19 M 1 2014-04-01 1 10
	3 30-40 F 1 2014-05-01 1 20
	;
RUN;
Jianan_luna
Obsidian | Level 7

I am using GROUP_BY to aggregate some variables. My goal is group by ID, then SUM the "Profit" only on April, COUNT the "Quantity" only on April, SUM the "SUB_NUM" only on May. I tried to find how to use IF function in aggregation, but I didn't find it. Could you please help me figure it out? Here is the dataset and my coding.

data cus_tran;
	input  ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE $10. QUANTITY Profit;
	Datalines;
	1 10-19 F 0 2014-05-06 1 13
	1 10-19	F 1 2014-05-06 1 13
	1 10-19 F 1 2014-04-10 2 22
	2 10-19 M 1 2014-04-11 2 10
	3 30-40 F 1 2014-04-07 1 10
	4 10-19 F 0 2014-04-08 1 10
	3 30-40 F 0 2014-04-22 1 20
	5 20-30 M 1 2014-04-30 1 20
	2 10-19 M 0 2014-05-01 1 20
	3 30-40 F 0 2014-05-06 1 10
	4 10-19 F 1 2014-05-10 3 20
	4 10-19 F 0 2014-04-30 1 20
	5 20-30 M 1 2014-05-10 1 20
	2 10-19 M 0 2014-05-10 1 30
	3 30-40 F 0 2014-05-01 1 10
	2 10-19 M 1 2014-04-01 1 10
	3 30-40 F 1 2014-05-01 1 20
	;
RUN;
proc sql;
	SELECT  ID, sum (Profit) as totalProfit, count (QUANTITY) as totalQuantity, 
			sum (SUB_NUM) as frequency 
	From cus_tran
	Group By ID;
quit;

This is my coding, but it includes all data from both April and May, can you please help me separate it? Thanks so much!

 

Sincerely,

Thanks

novinosrin
Tourmaline | Level 20



data cus_tran;
	input  ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE :yymmdd10. QUANTITY Profit;
	format TRANSACTION_DATE yymmdd10.;
	Datalines;
	1 10-19 F 0 2014-05-06 1 13
	1 10-19	F 1 2014-05-06 1 13
	1 10-19 F 1 2014-04-10 2 22
	2 10-19 M 1 2014-04-11 2 10
	3 30-40 F 1 2014-04-07 1 10
	4 10-19 F 0 2014-04-08 1 10
	3 30-40 F 0 2014-04-22 1 20
	5 20-30 M 1 2014-04-30 1 20
	2 10-19 M 0 2014-05-01 1 20
	3 30-40 F 0 2014-05-06 1 10
	4 10-19 F 1 2014-05-10 3 20
	4 10-19 F 0 2014-04-30 1 20
	5 20-30 M 1 2014-05-10 1 20
	2 10-19 M 0 2014-05-10 1 30
	3 30-40 F 0 2014-05-01 1 10
	2 10-19 M 1 2014-04-01 1 10
	3 30-40 F 1 2014-05-01 1 20
	;
RUN;

proc sql;
   create table want as
	SELECT  ID,put(TRANSACTION_DATE,monyy7.) as month_year, sum (Profit) as totalProfit, count (QUANTITY) as totalQuantity, 
			sum (SUB_NUM) as frequency 
	From cus_tran
	Group By ID,month_year;
quit;
Jianan_luna
Obsidian | Level 7

Thanks so much, but when I run the coding, it shows errors in log. The error is like following: the TRANSACTION_DATE should be numeric, but actually it's a character type. Could you please help me fix it? Thanks so much again!

Jianan_luna_0-1606525576218.png

 

PaigeMiller
Diamond | Level 26

The code from @novinosrin changes the INPUT statement to make transaction_date numeric. In general, if something is a date (or time or date-time) then you really ought to input it as numeric. There is no value in making dates or times character and huge benefits to making it numeric.

--
Paige Miller
Jianan_luna
Obsidian | Level 7

Thanks so much Sir! I want to get the corresponding gender and age for every ID. I add gender and age after SELECT clause, but when I added it, then it the ID is not grouped at all. like this:

Is there any way to get the corresponding age and gender for everyone with Group_by function. Thanks so much. Here is my coding:

data cus_tran;
	input  ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE :yymmdd10. QUANTITY Profit;
	format TRANSACTION_DATE yymmdd10.;
	Datalines;
	1 10-19 F 0 2014-05-06 1 13
	1 10-19	F 1 2014-05-06 1 13
	1 10-19 F 1 2014-04-10 2 22
	2 10-19 M 1 2014-04-11 2 10
	3 30-40 F 1 2014-04-07 1 10
	4 10-19 F 0 2014-04-08 1 10
	3 30-40 F 0 2014-04-22 1 20
	5 20-30 M 1 2014-04-30 1 20
	2 10-19 M 0 2014-05-01 1 20
	3 30-40 F 0 2014-05-06 1 10
	4 10-19 F 1 2014-05-10 3 20
	4 10-19 F 0 2014-04-30 1 20
	5 20-30 M 1 2014-05-10 1 20
	2 10-19 M 0 2014-05-10 1 30
	3 30-40 F 0 2014-05-01 1 10
	2 10-19 M 1 2014-04-01 1 10
	3 30-40 F 1 2014-05-01 1 20
	;
RUN;

proc sql;
   create table want as
	SELECT  ID,put(TRANSACTION_DATE,monyy7.) as month_year, sum (Profit) as totalProfit, count (QUANTITY) as totalQuantity, 
			sum (SUB_NUM) as frequency, GENDER, AGE 
	From cus_tran
	Group By ID,GENDER,AGE,month_year;
quit;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 22 replies
  • 1336 views
  • 0 likes
  • 6 in conversation