BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
danwarags
Obsidian | Level 7

Hello Everyone,

 

Below dataset shows the automobile sales for each individual and the duration of time they spent using those vehicles. I want to create a table which shows number of individuals who bought 1 vehicle, 2 vehicles, 3 vehicles etc.. Also, I need to know average age of these individuals, %males, %females and average duration these individuals spent using those vehicles. Below is the sample dataset. 

 

Customer_IDsalesagegendersdateedate
1car12M1/1/20011/12/2001
1bike 12M1/2/20011/18/2001
1truck 14M1/6/20031/8/2003
2car22F3/4/20013/8/2001
3bike 34M2/4/20022/12/2002
3bike 34M2/10/20022/24/2002
3truck 35M2/14/20032/18/2003
6bike 74F3/15/20033/18/2003
4car40M3/15/20033/18/2003
4truck 41M3/20/20043/26/2004
5bike 32F3/23/20013/29/2004


My output should look something like below. First column represent; number of vehicels sold . Second Column represents, how many such individuals bought these vehicles. Example: There was 1 sale for 3 subject IDs (ID-2,6,5), 2 sales for only one ID (ID-4) etc. Now I need find average age of individuals, %males, %females and average duration these individuals used these vehicles. 

 

No of vehiclesNo of Individuals Avg Age (Mean, SD)%males%femalesAvg duration   (Mean, SD)
13    
21    
32    

 

 

The average age of first row in the above table should be: (22+74+32)/3. But, there is an outlier 74 in this age. What is the best way to calculate average age if there is an outlier. Moreover, the third row has two individuals who had three sales. So, will the average age be (12+12+14+34+34+35)/6. How should I calculate. Guide me. 

 

Thank you in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
data have;
infile cards expandtabs truncover;
input Customer_ID	sales $	age	gender $ (sdate	edate) (: mmddyy10.);
format sdate edate mmddyy10.;
cards;
1	car	12	M	1/1/2001	1/12/2001
1	bike 	12	M	1/2/2001	1/18/2001
1	truck 	14	M	1/6/2003	1/8/2003
2	car	22	F	3/4/2001	3/8/2001
3	bike 	34	M	2/4/2002	2/12/2002
3	bike 	34	M	2/10/2002	2/24/2002
3	truck 	35	M	2/14/2003	2/18/2003
6	bike 	74	F	3/15/2003	3/18/2003
4	car	40	M	3/15/2003	3/18/2003
4	truck 	41	M	3/20/2004	3/26/2004
5	bike 	32	F	3/23/2001	3/29/2004
;
run;
proc sql;
create table temp as 
 select customer_id,
        count(*) as n,
        avg(age) as age,
        max(gender) as gender,
		avg(edate-sdate) as dur
  from have
   group by customer_id;

create table want as
 select n,
        count(*) as n_individual,
		avg(age) as avg_age,
		std(age) as std_age,
		avg(dur) as avg_dur,
		std(dur) as std_dur,
		sum(gender='F')/count(*) as per_female format=percent7.2,
		sum(gender='M')/count(*) as per_male format=percent7.2
  from temp
   group by n;
quit;



For outliers , you could use proc robustreg or IML function (LTS() LMS() ......) to identify them .

View solution in original post

6 REPLIES 6
Reeza
Super User

Majority of your requirements are either proc means or proc freq. 

For outliers in your average look at the TRIMMED MEAN in proc means.

 

http://support.sas.com/training/tutorial/

Look at the Summary Statistics and Descriptive Statistics videos. Right hand side of the page. 

Ksharp
Super User
data have;
infile cards expandtabs truncover;
input Customer_ID	sales $	age	gender $ (sdate	edate) (: mmddyy10.);
format sdate edate mmddyy10.;
cards;
1	car	12	M	1/1/2001	1/12/2001
1	bike 	12	M	1/2/2001	1/18/2001
1	truck 	14	M	1/6/2003	1/8/2003
2	car	22	F	3/4/2001	3/8/2001
3	bike 	34	M	2/4/2002	2/12/2002
3	bike 	34	M	2/10/2002	2/24/2002
3	truck 	35	M	2/14/2003	2/18/2003
6	bike 	74	F	3/15/2003	3/18/2003
4	car	40	M	3/15/2003	3/18/2003
4	truck 	41	M	3/20/2004	3/26/2004
5	bike 	32	F	3/23/2001	3/29/2004
;
run;
proc sql;
create table temp as 
 select customer_id,
        count(*) as n,
        avg(age) as age,
        max(gender) as gender,
		avg(edate-sdate) as dur
  from have
   group by customer_id;

create table want as
 select n,
        count(*) as n_individual,
		avg(age) as avg_age,
		std(age) as std_age,
		avg(dur) as avg_dur,
		std(dur) as std_dur,
		sum(gender='F')/count(*) as per_female format=percent7.2,
		sum(gender='M')/count(*) as per_male format=percent7.2
  from temp
   group by n;
quit;



For outliers , you could use proc robustreg or IML function (LTS() LMS() ......) to identify them .
danwarags
Obsidian | Level 7

Thank you so much Ksharp. This was the exact output I needed. But, the only problem is the output dataset shows higher standard deviation for age, which means that there are outliers that needs to be eliminated. I have never used proc robustreg before. I am a physician by profession, so have less stats background. Could you please guide me eliminating outliers in the below dataset. 

 

IDsalesage
1car12
1bike 12
1truck 14
2car22
3bike 34
3bike 34
3truck 35
6bike 74
4car40
4truck 41
5bike 32
7car34
7bike 35
7plane36
7truck 37
8bike 72
8car73
8plane73
8truck 74

 

After eliminating outliers my output should look as below: 

 

No of vehiclesNo of IDsAvg Agestd Age
13  
21  
32  
42  

 

I really appreciate your help. Thank you in advance. 

Ksharp
Super User
You didn't offer enough data .
Apply the following code, Open dataset OUTLIERS.



data have;
infile cards expandtabs truncover;
input Customer_ID	sales $	age	gender $ (sdate	edate) (: mmddyy10.);
format sdate edate mmddyy10.;
cards;
1	car	12	M	1/1/2001	1/12/2001
1	bike 	12	M	1/2/2001	1/18/2001
1	truck 	14	M	1/6/2003	1/8/2003
2	car	22	F	3/4/2001	3/8/2001
3	bike 	34	M	2/4/2002	2/12/2002
3	bike 	34	M	2/10/2002	2/24/2002
3	truck 	35	M	2/14/2003	2/18/2003
6	bike 	74	F	3/15/2003	3/18/2003
4	car	40	M	3/15/2003	3/18/2003
4	truck 	41	M	3/20/2004	3/26/2004
5	bike 	32	F	3/23/2001	3/29/2004
;
run;
proc robustreg data=have  method=lts;
model age = ;
output out=outliers outlier=outliers;
run;



Ksharp
Super User
Or IML code.


data have;
infile cards expandtabs truncover;
input Customer_ID	sales $	age	gender $ (sdate	edate) (: mmddyy10.);
format sdate edate mmddyy10.;
cards;
1	car	12	M	1/1/2001	1/12/2001
1	bike 	12	M	1/2/2001	1/18/2001
1	truck 	14	M	1/6/2003	1/8/2003
2	car	22	F	3/4/2001	3/8/2001
3	bike 	34	M	2/4/2002	2/12/2002
3	bike 	34	M	2/10/2002	2/24/2002
3	truck 	35	M	2/14/2003	2/18/2003
6	bike 	74	F	3/15/2003	3/18/2003
4	car	40	M	3/15/2003	3/18/2003
4	truck 	41	M	3/20/2004	3/26/2004
5	bike 	32	F	3/23/2001	3/29/2004
;
run;
proc iml;
use have;
read all var{age};
close;

optn = j(9,1,.);
call lms(scLMS, coefLMS, wgtLMS, optn, age);
call lts(scLTS, coefLTS, wgtLTS, optn, age);
LMSOutliers = loc(wgtLMS[1,]=0);
LTSOutliers = loc(wgtLTS[1,]=0);
print LMSOutliers, LTSOutliers;

quit;



OUPTUT: ( the number of obs)

LMSOutliers
1	2	3	8
LTSOutliers
1	2	3	8



Ksharp
Super User

proc robustreg data=have;
model age=/ cutoff=1;
output out=outliers outlier=outlier;
run;
proc print;run;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 4331 views
  • 3 likes
  • 3 in conversation