Solved: About Group By

About Group By

Re: About Transpose

Re: About Group By

Re: About Group By

Re: About Group By

Re: About Group By

Re: About Group By

Re: About Group By

Re: About Group By

Re: About Group By

Re: About Group By

About Group_by aggregation functions

Re: About Group_by aggregation functions

Re: About Group_by aggregation functions

Re: About Group_by aggregation functions

Re: About Group_by aggregation functions

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!

Jianan_luna · Posted 11-27-2020 02:06 AM

Recently, I am working on prediction for one dataset(for a store). I want to use group_by function to conclude some attributes. But my goal is a little complicated, because I want to separate the data into April and May.

For each specific customer(CUSTOMER_ID):

Profit, AGE, GENDER for each customer in April
QUANTITY purchased by each customer in April
I count TRANSACTION_DATE to find the how many times each customers come to store in April
SUB_NUM means whether the customers purchase certain product, I use 1 represent the customer purchased it, 0 represents they don't. Here, my goal is: for each customer, did they purchase it on April, and did they purchase it on May, so basically, it's two columns. Here is the most confused point.

The my desired outcome is like this, for each specific customer, I can conclude:

here is the coding I have

proc sql;
	SELECT  customer_ID, sum (Profit) as totalProfit, count (QUANTITY) as totalQuantity,  count (TRANSACTION_DATE) as frequency,  
	From cus_tran
	where TRANSACTION_DATE < "01MAY2014"d
	Group By customer_ID;
quit;

I am confused, because I want to group each specific customers, and conclude several data only on April, one only on May, and combine those dataset in one table. I think where clause is not available here, if only focus on April data, the data is likely to become incomplete, thus I cannot got data for each customers. I spend a so hard time on it, can you please help me fix it? Thanks so much!

Tom · Posted 11-28-2020 10:56 AM

Let's translate the table headers into an SQL query.

select 
   ID
  ,sum(case when month(transaction_date)=4 then profit else . end)
     as total_profit_only_in_April
  ,sum(case when month(transaction_date)=4 then quantity else . end)
     as total_quantity_only_in_April
  ,count(case when month(transaction_date)=4 then quantity else . end)
     as frequency_only_in_April
  ,count(case when month(transaction_date)=5 then quantity else . end)
     as frequency_only_in_May
  ,max(gender) as gender
  ,max(age) as age
from cus_tran
group by id 
;

View solution in original post

andreas_lds · Posted 11-27-2020 02:29 AM

Can you please post the data you have in usable form? This is a data step using datalines.

Jianan_luna · Posted 11-27-2020 02:54 AM

Sorry, can you please let me know what kind of data type you are looking
for? I didn’t make sense

andreas_lds · Posted 11-27-2020 03:12 AM

@Jianan_luna wrote:
Sorry, can you please let me know what kind of data type you are looking
for? I didn’t make sense

To be able to suggest something useful, i need to see data you have.

Jianan_luna · Posted 11-27-2020 03:19 AM

I post the attachment “cus_tran_sas.xlsx” under my question. Please check it

Kurt_Bremser · Posted 11-27-2020 03:45 AM

Excel spreadsheets are mostly useless in representing SAS datasets. Excel does not have the concept of fixed columns attributes, so we can not infer types, sizes, and formats from an Excel spreadsheet.

On top of that, many corporate environments block the download of Office files in their proxies and firewalls, because of security aspects. MS Office files are the #1 vector for malware attacks.

A data step with datalines, on the other hand, is just text and therefore no security concern at all. And it takes just a simple copy/paste of the code to our program editor and a submit to create an exact replica of your dataset, while Excel files force us to download and then import them, with quite unreliable and unpredictable results.

Creating such a data step with datalines is no rocket science (even people of limited intelligence, like myself, have mastered it), but a very useful SAS skill that SAS newcomers should learn to use right at the beginning. Creating fake data for testing is a necessary tool for your SAS toolbox.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Jianan_luna · Posted 11-27-2020 01:15 PM

Thanks so much Sir, I think I got it. I create a code like this, please check it


data cus_tran;
	input  ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE $10. QUANTITY Profit;
	Datalines;
	1 10-19 F 0 2014-05-06 1 13
	1 10-19	F 1 2014-05-06 1 13
	1 10-19 F 1 2014-04-10 2 22
	2 10-19 M 1 2014-04-11 2 10
	3 30-40 F 1 2014-04-07 1 10
	4 10-19 F 0 2014-04-08 1 10
	3 30-40 F 0 2014-04-22 1 20
	5 20-30 M 1 2014-04-30 1 20
	2 10-19 M 0 2014-05-01 1 20
	3 30-40 F 0 2014-05-06 1 10
	4 10-19 F 1 2014-05-10 3 20
	4 10-19 F 0 2014-04-30 1 20
	5 20-30 M 1 2014-05-10 1 20
	2 10-19 M 0 2014-05-10 1 30
	3 30-40 F 0 2014-05-01 1 10
	2 10-19 M 1 2014-04-01 1 10
	3 30-40 F 1 2014-05-01 1 20
	;
RUN;

Kurt_Bremser · Posted 11-28-2020 03:44 AM

So we are interested in dates of April and May 2014. The WHERE should look like this

where '01apr2014'd le transaction_date le '31may2014'd

Since you want age and gender in the result, you need to have them in the GROUP BY

group by customer_id, age, gender

To build a conditional sum, use a combination of the summary function and CASE

sum(case when transaction_date lt '01may2014'd then sub_num else 0 end) as sub_num_april

From these blocks, you should be able to build your query.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Jianan_luna · Posted 11-28-2020 12:33 PM

Thanks so much!

Jianan_luna · Posted 11-27-2020 01:16 PM

Thanks so much, here is the dataline I created, please check:


data cus_tran;
	input  ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE $10. QUANTITY Profit;
	Datalines;
	1 10-19 F 0 2014-05-06 1 13
	1 10-19	F 1 2014-05-06 1 13
	1 10-19 F 1 2014-04-10 2 22
	2 10-19 M 1 2014-04-11 2 10
	3 30-40 F 1 2014-04-07 1 10
	4 10-19 F 0 2014-04-08 1 10
	3 30-40 F 0 2014-04-22 1 20
	5 20-30 M 1 2014-04-30 1 20
	2 10-19 M 0 2014-05-01 1 20
	3 30-40 F 0 2014-05-06 1 10
	4 10-19 F 1 2014-05-10 3 20
	4 10-19 F 0 2014-04-30 1 20
	5 20-30 M 1 2014-05-10 1 20
	2 10-19 M 0 2014-05-10 1 30
	3 30-40 F 0 2014-05-01 1 10
	2 10-19 M 1 2014-04-01 1 10
	3 30-40 F 1 2014-05-01 1 20
	;
RUN;

Jianan_luna · Posted 11-27-2020 07:43 PM

I am using GROUP_BY to aggregate some variables. My goal is group by ID, then SUM the "Profit" only on April, COUNT the "Quantity" only on April, SUM the "SUB_NUM" only on May. I tried to find how to use IF function in aggregation, but I didn't find it. Could you please help me figure it out? Here is the dataset and my coding.

data cus_tran;
	input  ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE $10. QUANTITY Profit;
	Datalines;
	1 10-19 F 0 2014-05-06 1 13
	1 10-19	F 1 2014-05-06 1 13
	1 10-19 F 1 2014-04-10 2 22
	2 10-19 M 1 2014-04-11 2 10
	3 30-40 F 1 2014-04-07 1 10
	4 10-19 F 0 2014-04-08 1 10
	3 30-40 F 0 2014-04-22 1 20
	5 20-30 M 1 2014-04-30 1 20
	2 10-19 M 0 2014-05-01 1 20
	3 30-40 F 0 2014-05-06 1 10
	4 10-19 F 1 2014-05-10 3 20
	4 10-19 F 0 2014-04-30 1 20
	5 20-30 M 1 2014-05-10 1 20
	2 10-19 M 0 2014-05-10 1 30
	3 30-40 F 0 2014-05-01 1 10
	2 10-19 M 1 2014-04-01 1 10
	3 30-40 F 1 2014-05-01 1 20
	;
RUN;

proc sql;
	SELECT  ID, sum (Profit) as totalProfit, count (QUANTITY) as totalQuantity, 
			sum (SUB_NUM) as frequency 
	From cus_tran
	Group By ID;
quit;

This is my coding, but it includes all data from both April and May, can you please help me separate it? Thanks so much!

Sincerely,

Thanks

novinosrin · Posted 11-27-2020 07:50 PM




data cus_tran;
	input  ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE :yymmdd10. QUANTITY Profit;
	format TRANSACTION_DATE yymmdd10.;
	Datalines;
	1 10-19 F 0 2014-05-06 1 13
	1 10-19	F 1 2014-05-06 1 13
	1 10-19 F 1 2014-04-10 2 22
	2 10-19 M 1 2014-04-11 2 10
	3 30-40 F 1 2014-04-07 1 10
	4 10-19 F 0 2014-04-08 1 10
	3 30-40 F 0 2014-04-22 1 20
	5 20-30 M 1 2014-04-30 1 20
	2 10-19 M 0 2014-05-01 1 20
	3 30-40 F 0 2014-05-06 1 10
	4 10-19 F 1 2014-05-10 3 20
	4 10-19 F 0 2014-04-30 1 20
	5 20-30 M 1 2014-05-10 1 20
	2 10-19 M 0 2014-05-10 1 30
	3 30-40 F 0 2014-05-01 1 10
	2 10-19 M 1 2014-04-01 1 10
	3 30-40 F 1 2014-05-01 1 20
	;
RUN;

proc sql;
   create table want as
	SELECT  ID,put(TRANSACTION_DATE,monyy7.) as month_year, sum (Profit) as totalProfit, count (QUANTITY) as totalQuantity, 
			sum (SUB_NUM) as frequency 
	From cus_tran
	Group By ID,month_year;
quit;

Jianan_luna · Posted 11-27-2020 08:07 PM

Thanks so much, but when I run the coding, it shows errors in log. The error is like following: the TRANSACTION_DATE should be numeric, but actually it's a character type. Could you please help me fix it? Thanks so much again!

PaigeMiller · Posted 11-27-2020 08:10 PM

The code from @novinosrin changes the INPUT statement to make transaction_date numeric. In general, if something is a date (or time or date-time) then you really ought to input it as numeric. There is no value in making dates or times character and huge benefits to making it numeric.

--
Paige Miller

Jianan_luna · Posted 11-27-2020 08:41 PM

Thanks so much Sir! I want to get the corresponding gender and age for every ID. I add gender and age after SELECT clause, but when I added it, then it the ID is not grouped at all. like this:

Is there any way to get the corresponding age and gender for everyone with Group_by function. Thanks so much. Here is my coding:

data cus_tran;
	input  ID AGE $5. GENDER $ SUB_NUM TRANSACTION_DATE :yymmdd10. QUANTITY Profit;
	format TRANSACTION_DATE yymmdd10.;
	Datalines;
	1 10-19 F 0 2014-05-06 1 13
	1 10-19	F 1 2014-05-06 1 13
	1 10-19 F 1 2014-04-10 2 22
	2 10-19 M 1 2014-04-11 2 10
	3 30-40 F 1 2014-04-07 1 10
	4 10-19 F 0 2014-04-08 1 10
	3 30-40 F 0 2014-04-22 1 20
	5 20-30 M 1 2014-04-30 1 20
	2 10-19 M 0 2014-05-01 1 20
	3 30-40 F 0 2014-05-06 1 10
	4 10-19 F 1 2014-05-10 3 20
	4 10-19 F 0 2014-04-30 1 20
	5 20-30 M 1 2014-05-10 1 20
	2 10-19 M 0 2014-05-10 1 30
	3 30-40 F 0 2014-05-01 1 10
	2 10-19 M 1 2014-04-01 1 10
	3 30-40 F 1 2014-05-01 1 20
	;
RUN;

proc sql;
   create table want as
	SELECT  ID,put(TRANSACTION_DATE,monyy7.) as month_year, sum (Profit) as totalProfit, count (QUANTITY) as totalQuantity, 
			sum (SUB_NUM) as frequency, GENDER, AGE 
	From cus_tran
	Group By ID,GENDER,AGE,month_year;
quit;

customer_id

sub_num on April

sub_num on May

Quantity purchased on April

sum of "transaction_date" on April