Hello Everyone,
Below dataset shows the automobile sales for each individual and the duration of time they spent using those vehicles. I want to create a table which shows number of individuals who bought 1 vehicle, 2 vehicles, 3 vehicles etc.. Also, I need to know average age of these individuals, %males, %females and average duration these individuals spent using those vehicles. Below is the sample dataset.
Customer_ID | sales | age | gender | sdate | edate |
1 | car | 12 | M | 1/1/2001 | 1/12/2001 |
1 | bike | 12 | M | 1/2/2001 | 1/18/2001 |
1 | truck | 14 | M | 1/6/2003 | 1/8/2003 |
2 | car | 22 | F | 3/4/2001 | 3/8/2001 |
3 | bike | 34 | M | 2/4/2002 | 2/12/2002 |
3 | bike | 34 | M | 2/10/2002 | 2/24/2002 |
3 | truck | 35 | M | 2/14/2003 | 2/18/2003 |
6 | bike | 74 | F | 3/15/2003 | 3/18/2003 |
4 | car | 40 | M | 3/15/2003 | 3/18/2003 |
4 | truck | 41 | M | 3/20/2004 | 3/26/2004 |
5 | bike | 32 | F | 3/23/2001 | 3/29/2004 |
My output should look something like below. First column represent; number of vehicels sold . Second Column represents, how many such individuals bought these vehicles. Example: There was 1 sale for 3 subject IDs (ID-2,6,5), 2 sales for only one ID (ID-4) etc. Now I need find average age of individuals, %males, %females and average duration these individuals used these vehicles.
No of vehicles | No of Individuals | Avg Age (Mean, SD) | %males | %females | Avg duration (Mean, SD) |
1 | 3 | ||||
2 | 1 | ||||
3 | 2 |
The average age of first row in the above table should be: (22+74+32)/3. But, there is an outlier 74 in this age. What is the best way to calculate average age if there is an outlier. Moreover, the third row has two individuals who had three sales. So, will the average age be (12+12+14+34+34+35)/6. How should I calculate. Guide me.
Thank you in advance!
data have; infile cards expandtabs truncover; input Customer_ID sales $ age gender $ (sdate edate) (: mmddyy10.); format sdate edate mmddyy10.; cards; 1 car 12 M 1/1/2001 1/12/2001 1 bike 12 M 1/2/2001 1/18/2001 1 truck 14 M 1/6/2003 1/8/2003 2 car 22 F 3/4/2001 3/8/2001 3 bike 34 M 2/4/2002 2/12/2002 3 bike 34 M 2/10/2002 2/24/2002 3 truck 35 M 2/14/2003 2/18/2003 6 bike 74 F 3/15/2003 3/18/2003 4 car 40 M 3/15/2003 3/18/2003 4 truck 41 M 3/20/2004 3/26/2004 5 bike 32 F 3/23/2001 3/29/2004 ; run; proc sql; create table temp as select customer_id, count(*) as n, avg(age) as age, max(gender) as gender, avg(edate-sdate) as dur from have group by customer_id; create table want as select n, count(*) as n_individual, avg(age) as avg_age, std(age) as std_age, avg(dur) as avg_dur, std(dur) as std_dur, sum(gender='F')/count(*) as per_female format=percent7.2, sum(gender='M')/count(*) as per_male format=percent7.2 from temp group by n; quit; For outliers , you could use proc robustreg or IML function (LTS() LMS() ......) to identify them .
Majority of your requirements are either proc means or proc freq.
For outliers in your average look at the TRIMMED MEAN in proc means.
http://support.sas.com/training/tutorial/
Look at the Summary Statistics and Descriptive Statistics videos. Right hand side of the page.
data have; infile cards expandtabs truncover; input Customer_ID sales $ age gender $ (sdate edate) (: mmddyy10.); format sdate edate mmddyy10.; cards; 1 car 12 M 1/1/2001 1/12/2001 1 bike 12 M 1/2/2001 1/18/2001 1 truck 14 M 1/6/2003 1/8/2003 2 car 22 F 3/4/2001 3/8/2001 3 bike 34 M 2/4/2002 2/12/2002 3 bike 34 M 2/10/2002 2/24/2002 3 truck 35 M 2/14/2003 2/18/2003 6 bike 74 F 3/15/2003 3/18/2003 4 car 40 M 3/15/2003 3/18/2003 4 truck 41 M 3/20/2004 3/26/2004 5 bike 32 F 3/23/2001 3/29/2004 ; run; proc sql; create table temp as select customer_id, count(*) as n, avg(age) as age, max(gender) as gender, avg(edate-sdate) as dur from have group by customer_id; create table want as select n, count(*) as n_individual, avg(age) as avg_age, std(age) as std_age, avg(dur) as avg_dur, std(dur) as std_dur, sum(gender='F')/count(*) as per_female format=percent7.2, sum(gender='M')/count(*) as per_male format=percent7.2 from temp group by n; quit; For outliers , you could use proc robustreg or IML function (LTS() LMS() ......) to identify them .
Thank you so much Ksharp. This was the exact output I needed. But, the only problem is the output dataset shows higher standard deviation for age, which means that there are outliers that needs to be eliminated. I have never used proc robustreg before. I am a physician by profession, so have less stats background. Could you please guide me eliminating outliers in the below dataset.
ID | sales | age |
1 | car | 12 |
1 | bike | 12 |
1 | truck | 14 |
2 | car | 22 |
3 | bike | 34 |
3 | bike | 34 |
3 | truck | 35 |
6 | bike | 74 |
4 | car | 40 |
4 | truck | 41 |
5 | bike | 32 |
7 | car | 34 |
7 | bike | 35 |
7 | plane | 36 |
7 | truck | 37 |
8 | bike | 72 |
8 | car | 73 |
8 | plane | 73 |
8 | truck | 74 |
After eliminating outliers my output should look as below:
No of vehicles | No of IDs | Avg Age | std Age |
1 | 3 | ||
2 | 1 | ||
3 | 2 | ||
4 | 2 |
I really appreciate your help. Thank you in advance.
You didn't offer enough data . Apply the following code, Open dataset OUTLIERS. data have; infile cards expandtabs truncover; input Customer_ID sales $ age gender $ (sdate edate) (: mmddyy10.); format sdate edate mmddyy10.; cards; 1 car 12 M 1/1/2001 1/12/2001 1 bike 12 M 1/2/2001 1/18/2001 1 truck 14 M 1/6/2003 1/8/2003 2 car 22 F 3/4/2001 3/8/2001 3 bike 34 M 2/4/2002 2/12/2002 3 bike 34 M 2/10/2002 2/24/2002 3 truck 35 M 2/14/2003 2/18/2003 6 bike 74 F 3/15/2003 3/18/2003 4 car 40 M 3/15/2003 3/18/2003 4 truck 41 M 3/20/2004 3/26/2004 5 bike 32 F 3/23/2001 3/29/2004 ; run; proc robustreg data=have method=lts; model age = ; output out=outliers outlier=outliers; run;
Or IML code. data have; infile cards expandtabs truncover; input Customer_ID sales $ age gender $ (sdate edate) (: mmddyy10.); format sdate edate mmddyy10.; cards; 1 car 12 M 1/1/2001 1/12/2001 1 bike 12 M 1/2/2001 1/18/2001 1 truck 14 M 1/6/2003 1/8/2003 2 car 22 F 3/4/2001 3/8/2001 3 bike 34 M 2/4/2002 2/12/2002 3 bike 34 M 2/10/2002 2/24/2002 3 truck 35 M 2/14/2003 2/18/2003 6 bike 74 F 3/15/2003 3/18/2003 4 car 40 M 3/15/2003 3/18/2003 4 truck 41 M 3/20/2004 3/26/2004 5 bike 32 F 3/23/2001 3/29/2004 ; run; proc iml; use have; read all var{age}; close; optn = j(9,1,.); call lms(scLMS, coefLMS, wgtLMS, optn, age); call lts(scLTS, coefLTS, wgtLTS, optn, age); LMSOutliers = loc(wgtLMS[1,]=0); LTSOutliers = loc(wgtLTS[1,]=0); print LMSOutliers, LTSOutliers; quit; OUPTUT: ( the number of obs) LMSOutliers 1 2 3 8 LTSOutliers 1 2 3 8
proc robustreg data=have; model age=/ cutoff=1; output out=outliers outlier=outlier; run; proc print;run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.