Solved: How to Subset with Only First Top 90% by Each Group?

Junyong · Posted 09-10-2019 01:41 AM

I'm a bit unsure if the title describes the problem correctly, but I have the following data set.

data have;
	input year firm size indep depen;
cards;
1	1	4907.98	514.14	515.08
1	2	5045.94	509.85	509.03
1	3	4819.58	505.49	506.34
1	4	4967.02	482.39	482.02
1	5	5094.18	506.08	505.62
1	6	5040.42	501.66	499.79
1	7	4987.62	491.48	489.79
2	1	5086.52	512.95	510.39
2	2	4895.88	503.15	503.28
2	3	5002.13	494.76	495.36
3	1	5078.44	506.06	508.51
3	2	4980.23	520.75	521.21
3	3	4948.08	484.90	485.79
3	4	5050.82	497.25	497.74
3	5	4964.70	498.54	497.97
4	1	4952.31	517.95	517.92
4	2	5027.91	501.90	502.15
4	3	4979.13	512.67	512.07
4	4	5005.56	499.99	501.16
4	5	4905.29	517.62	517.56
4	6	4848.28	479.76	481.32
5	1	1969.43	501.68	502.21
5	2	1931.39	497.89	497.67
5	3	9935.65	505.15	504.44
5	4	9995.49	510.52	511.26
5	5	9008.28	525.02	524.49
6	1	9981.95	518.30	517.78
6	2	15.49	506.13	506.81
7	1	5095.48	517.15	516.85
;
run;

Where YEAR, FIRM, SIZE, INDEP, and DEPEN are the time index, the individual index, the market capitalization, the independent variable, and the dependent variable, correspondingly. Instead of using all the observations to do the regression, I need to pick only the first top 90% in terms of SIZE each year, so I'm currently making the subset as follows.

proc sort;
	by year descending size;
run;

data have;
	set have;
	by year;
	if first.year then accumulate=.;
	accumulate+size;
run;

proc sql;
	create table have as
	select *,
		0.9*max(accumulate) as hurdle,
		accumulate<=calculated hurdle as pick
	from have
	group by year
	order by year,size desc;
quit;

proc reg;
	where pick=1;
	model depen=indep;
run;

In other words, I

(1) downward sort by SIZE each year,

(2) ACCUMULATE by SIZE each year,

(3) set the 90% of the sum as HURDLE each year,

(4) PICK if ACCUMULATE up to the observation each year doesn't exceed HURDLE, and

(5) regress DEPEN on INDEP using the observations where PICK=1.

Here's the outcome.

1. Is there any other approach simpler and more experienced than this? I can understand these processes, but still wonder whether SAS provides some matching PROCs for the subsets.

2. I want to also include the observation that "touches" the HURDLE as

(a) this method picks no observation in Year 6 since the first observation already exceeds the HURDLE, and

(b) picks no observation in Year 7 since there's only one observation.

According to the result, I want to pick

(i) the last observations in Years 1, 2, 3, and 4 as they touch the HURDLEs,

(ii) the second last observation in Year 5 due to the same reason,

(iii) the first observation in Year 6 that alone exceeds the HURDLE, and

(iv) the only one observation in Year 7.

I much appreciate any comment from your experience.

Patrick · Posted 09-10-2019 10:27 PM

@Junyong

Below code returns the result you're showing us in your initial post.

options ps=max ls=max;
data have;
  infile datalines truncover dlm=' ';
  input year firm size indep depen;
  cards;
1 1 4907.98 514.14  515.08
1 2 5045.94 509.85  509.03
1 3 4819.58 505.49  506.34
1 4 4967.02 482.39  482.02
1 5 5094.18 506.08  505.62
1 6 5040.42 501.66  499.79
1 7 4987.62 491.48  489.79
2 1 5086.52 512.95  510.39
2 2 4895.88 503.15  503.28
2 3 5002.13 494.76  495.36
3 1 5078.44 506.06  508.51
3 2 4980.23 520.75  521.21
3 3 4948.08 484.90  485.79
3 4 5050.82 497.25  497.74
3 5 4964.70 498.54  497.97
4 1 4952.31 517.95  517.92
4 2 5027.91 501.90  502.15
4 3 4979.13 512.67  512.07
4 4 5005.56 499.99  501.16
4 5 4905.29 517.62  517.56
4 6 4848.28 479.76  481.32
5 1 1969.43 501.68  502.21
5 2 1931.39 497.89  497.67
5 3 9935.65 505.15  504.44
5 4 9995.49 510.52  511.26
5 5 9008.28 525.02  524.49
6 1 9981.95 518.30  517.78
6 2 15.49 506.13  506.81
7 1 5095.48 517.15  516.85
;
run;

proc sql;
  create table inter as
    select *, sum(size)*0.9 as hurdle
    from have
    group by year
    order by year, size DESC
    ;
quit;

data want;
  set inter;
  by year;
  if first.year then accumulate=size;
  else accumulate+size;
  pick= accumulate<=hurdle;
run;

View solution in original post

PeterClemmensen · Posted 09-10-2019 02:34 AM

You can create the subset like this

proc summary data=have nway;
   class year;
   var size;
   output out=temp(drop= _:) p10=/autoname;
run;

data want(drop=size_P10);
    merge have temp;
    by year;
    if size gt size_P10;
run;

.. And then do your regression without any Where Statement.

The DATA to DATA Step Macro
Blog: SASnrd

Junyong · Posted 09-10-2019 03:36 AM

Thanks, but I'm not picking the observations above the 10th percentile. Instead, I'm picking the observations that consist 90% of the sum from the biggest one. Here's what you recommended.

data _1;
	input size @@;
	i=1;
cards;
1 2 3 4 5 6 7 8 9 10 395
;
run;

proc summary nway;
	class i;
	var size;
	output out=_2(drop=_:) p10=/autoname;
run;

data _1;
	merge _1 _2;
	by i;
	if size>=size_p10;
run;

There are 11 observations. PROC SUMMARY in your code picks "2" as the 10th percentile. The code then removes the first observation as SIZE is lower than the 10th percentile. The resulting subset includes 10 observations from 395 to 2.

This is different from what I'm asking. In this case, the sum of all SIZEs is 450, and the 90% of this sum is 405. That is, I can include 90% of the SIZEs in my subset by just including only the first two observations (395 and 10) there—the remainder observations are negligible as they are only the lower 10% of the sample. The resulting subset must include the 2 observations—395 and 10.

Patrick · Posted 09-10-2019 04:29 AM

@Junyong

Is below returning what you're after?

options ps=max ls=max;
data have;
  infile datalines truncover dlm=' ';
  input year firm size indep depen;
  cards;
1 1 4907.98 514.14  515.08
1 2 5045.94 509.85  509.03
1 3 4819.58 505.49  506.34
1 4 4967.02 482.39  482.02
1 5 5094.18 506.08  505.62
1 6 5040.42 501.66  499.79
1 7 4987.62 491.48  489.79
2 1 5086.52 512.95  510.39
2 2 4895.88 503.15  503.28
2 3 5002.13 494.76  495.36
3 1 5078.44 506.06  508.51
3 2 4980.23 520.75  521.21
3 3 4948.08 484.90  485.79
3 4 5050.82 497.25  497.74
3 5 4964.70 498.54  497.97
4 1 4952.31 517.95  517.92
4 2 5027.91 501.90  502.15
4 3 4979.13 512.67  512.07
4 4 5005.56 499.99  501.16
4 5 4905.29 517.62  517.56
4 6 4848.28 479.76  481.32
5 1 1969.43 501.68  502.21
5 2 1931.39 497.89  497.67
5 3 9935.65 505.15  504.44
5 4 9995.49 510.52  511.26
5 5 9008.28 525.02  524.49
6 1 9981.95 518.30  517.78
6 2 15.49 506.13  506.81
7 1 5095.48 517.15  516.85
;
run;

proc sql;
  create table inter as
    select *, count(*) as _cntByYear
    from have
    group by year
    order by year, size DESC
    ;
quit;

data want(drop=_:);
  set inter;
  by year;
  if first.year then _cntCum=1;
  else _cntCum+1;
  if _cntCum/_cntByYear > 0.9 then delete;
run;

Junyong · Posted 09-10-2019 09:37 AM

Thanks again, but this also has the same issue—I'm picking not the first 90% observations, but the observations that occupies 90% of the sum each year. My first reply explains this in detail—what I need is not {395 10 9 8 7 6 5 4 3 2} (90% of all observations) but {395 10} (90% of the sum).

Patrick · Posted 09-10-2019 10:27 PM

@Junyong

Below code returns the result you're showing us in your initial post.

options ps=max ls=max;
data have;
  infile datalines truncover dlm=' ';
  input year firm size indep depen;
  cards;
1 1 4907.98 514.14  515.08
1 2 5045.94 509.85  509.03
1 3 4819.58 505.49  506.34
1 4 4967.02 482.39  482.02
1 5 5094.18 506.08  505.62
1 6 5040.42 501.66  499.79
1 7 4987.62 491.48  489.79
2 1 5086.52 512.95  510.39
2 2 4895.88 503.15  503.28
2 3 5002.13 494.76  495.36
3 1 5078.44 506.06  508.51
3 2 4980.23 520.75  521.21
3 3 4948.08 484.90  485.79
3 4 5050.82 497.25  497.74
3 5 4964.70 498.54  497.97
4 1 4952.31 517.95  517.92
4 2 5027.91 501.90  502.15
4 3 4979.13 512.67  512.07
4 4 5005.56 499.99  501.16
4 5 4905.29 517.62  517.56
4 6 4848.28 479.76  481.32
5 1 1969.43 501.68  502.21
5 2 1931.39 497.89  497.67
5 3 9935.65 505.15  504.44
5 4 9995.49 510.52  511.26
5 5 9008.28 525.02  524.49
6 1 9981.95 518.30  517.78
6 2 15.49 506.13  506.81
7 1 5095.48 517.15  516.85
;
run;

proc sql;
  create table inter as
    select *, sum(size)*0.9 as hurdle
    from have
    group by year
    order by year, size DESC
    ;
quit;

data want;
  set inter;
  by year;
  if first.year then accumulate=size;
  else accumulate+size;
  pick= accumulate<=hurdle;
run;

How to Subset with Only First Top 90% by Each Group?

Re: How to Subset with Only First Top 90% by Each Group?

Re: How to Subset with Only First Top 90% by Each Group?

Re: How to Subset with Only First Top 90% by Each Group?

Re: How to Subset with Only First Top 90% by Each Group?

Re: How to Subset with Only First Top 90% by Each Group?

Re: How to Subset with Only First Top 90% by Each Group?

SAS Innovate 2025: Call for Content

Classroom Training Available!