I'm a bit unsure if the title describes the problem correctly, but I have the following data set.
data have;
	input year firm size indep depen;
cards;
1	1	4907.98	514.14	515.08
1	2	5045.94	509.85	509.03
1	3	4819.58	505.49	506.34
1	4	4967.02	482.39	482.02
1	5	5094.18	506.08	505.62
1	6	5040.42	501.66	499.79
1	7	4987.62	491.48	489.79
2	1	5086.52	512.95	510.39
2	2	4895.88	503.15	503.28
2	3	5002.13	494.76	495.36
3	1	5078.44	506.06	508.51
3	2	4980.23	520.75	521.21
3	3	4948.08	484.90	485.79
3	4	5050.82	497.25	497.74
3	5	4964.70	498.54	497.97
4	1	4952.31	517.95	517.92
4	2	5027.91	501.90	502.15
4	3	4979.13	512.67	512.07
4	4	5005.56	499.99	501.16
4	5	4905.29	517.62	517.56
4	6	4848.28	479.76	481.32
5	1	1969.43	501.68	502.21
5	2	1931.39	497.89	497.67
5	3	9935.65	505.15	504.44
5	4	9995.49	510.52	511.26
5	5	9008.28	525.02	524.49
6	1	9981.95	518.30	517.78
6	2	15.49	506.13	506.81
7	1	5095.48	517.15	516.85
;
run;Where YEAR, FIRM, SIZE, INDEP, and DEPEN are the time index, the individual index, the market capitalization, the independent variable, and the dependent variable, correspondingly. Instead of using all the observations to do the regression, I need to pick only the first top 90% in terms of SIZE each year, so I'm currently making the subset as follows.
proc sort;
	by year descending size;
run;
data have;
	set have;
	by year;
	if first.year then accumulate=.;
	accumulate+size;
run;
proc sql;
	create table have as
	select *,
		0.9*max(accumulate) as hurdle,
		accumulate<=calculated hurdle as pick
	from have
	group by year
	order by year,size desc;
quit;
proc reg;
	where pick=1;
	model depen=indep;
run;In other words, I
(1) downward sort by SIZE each year,
(2) ACCUMULATE by SIZE each year,
(3) set the 90% of the sum as HURDLE each year,
(4) PICK if ACCUMULATE up to the observation each year doesn't exceed HURDLE, and
(5) regress DEPEN on INDEP using the observations where PICK=1.
Here's the outcome.
1. Is there any other approach simpler and more experienced than this? I can understand these processes, but still wonder whether SAS provides some matching PROCs for the subsets.
2. I want to also include the observation that "touches" the HURDLE as
(a) this method picks no observation in Year 6 since the first observation already exceeds the HURDLE, and
(b) picks no observation in Year 7 since there's only one observation.
According to the result, I want to pick
(i) the last observations in Years 1, 2, 3, and 4 as they touch the HURDLEs,
(ii) the second last observation in Year 5 due to the same reason,
(iii) the first observation in Year 6 that alone exceeds the HURDLE, and
(iv) the only one observation in Year 7.
I much appreciate any comment from your experience.
Below code returns the result you're showing us in your initial post.
options ps=max ls=max;
data have;
  infile datalines truncover dlm=' ';
  input year firm size indep depen;
  cards;
1 1 4907.98 514.14  515.08
1 2 5045.94 509.85  509.03
1 3 4819.58 505.49  506.34
1 4 4967.02 482.39  482.02
1 5 5094.18 506.08  505.62
1 6 5040.42 501.66  499.79
1 7 4987.62 491.48  489.79
2 1 5086.52 512.95  510.39
2 2 4895.88 503.15  503.28
2 3 5002.13 494.76  495.36
3 1 5078.44 506.06  508.51
3 2 4980.23 520.75  521.21
3 3 4948.08 484.90  485.79
3 4 5050.82 497.25  497.74
3 5 4964.70 498.54  497.97
4 1 4952.31 517.95  517.92
4 2 5027.91 501.90  502.15
4 3 4979.13 512.67  512.07
4 4 5005.56 499.99  501.16
4 5 4905.29 517.62  517.56
4 6 4848.28 479.76  481.32
5 1 1969.43 501.68  502.21
5 2 1931.39 497.89  497.67
5 3 9935.65 505.15  504.44
5 4 9995.49 510.52  511.26
5 5 9008.28 525.02  524.49
6 1 9981.95 518.30  517.78
6 2 15.49 506.13  506.81
7 1 5095.48 517.15  516.85
;
run;
proc sql;
  create table inter as
    select *, sum(size)*0.9 as hurdle
    from have
    group by year
    order by year, size DESC
    ;
quit;
data want;
  set inter;
  by year;
  if first.year then accumulate=size;
  else accumulate+size;
  pick= accumulate<=hurdle;
run;You can create the subset like this
proc summary data=have nway;
   class year;
   var size;
   output out=temp(drop= _:) p10=/autoname;
run;
data want(drop=size_P10);
    merge have temp;
    by year;
    if size gt size_P10;
run;
.. And then do your regression without any Where Statement.
Thanks, but I'm not picking the observations above the 10th percentile. Instead, I'm picking the observations that consist 90% of the sum from the biggest one. Here's what you recommended.
data _1;
	input size @@;
	i=1;
cards;
1 2 3 4 5 6 7 8 9 10 395
;
run;
proc summary nway;
	class i;
	var size;
	output out=_2(drop=_:) p10=/autoname;
run;
data _1;
	merge _1 _2;
	by i;
	if size>=size_p10;
run;There are 11 observations. PROC SUMMARY in your code picks "2" as the 10th percentile. The code then removes the first observation as SIZE is lower than the 10th percentile. The resulting subset includes 10 observations from 395 to 2.
This is different from what I'm asking. In this case, the sum of all SIZEs is 450, and the 90% of this sum is 405. That is, I can include 90% of the SIZEs in my subset by just including only the first two observations (395 and 10) there—the remainder observations are negligible as they are only the lower 10% of the sample. The resulting subset must include the 2 observations—395 and 10.
Is below returning what you're after?
options ps=max ls=max;
data have;
  infile datalines truncover dlm=' ';
  input year firm size indep depen;
  cards;
1 1 4907.98 514.14  515.08
1 2 5045.94 509.85  509.03
1 3 4819.58 505.49  506.34
1 4 4967.02 482.39  482.02
1 5 5094.18 506.08  505.62
1 6 5040.42 501.66  499.79
1 7 4987.62 491.48  489.79
2 1 5086.52 512.95  510.39
2 2 4895.88 503.15  503.28
2 3 5002.13 494.76  495.36
3 1 5078.44 506.06  508.51
3 2 4980.23 520.75  521.21
3 3 4948.08 484.90  485.79
3 4 5050.82 497.25  497.74
3 5 4964.70 498.54  497.97
4 1 4952.31 517.95  517.92
4 2 5027.91 501.90  502.15
4 3 4979.13 512.67  512.07
4 4 5005.56 499.99  501.16
4 5 4905.29 517.62  517.56
4 6 4848.28 479.76  481.32
5 1 1969.43 501.68  502.21
5 2 1931.39 497.89  497.67
5 3 9935.65 505.15  504.44
5 4 9995.49 510.52  511.26
5 5 9008.28 525.02  524.49
6 1 9981.95 518.30  517.78
6 2 15.49 506.13  506.81
7 1 5095.48 517.15  516.85
;
run;
proc sql;
  create table inter as
    select *, count(*) as _cntByYear
    from have
    group by year
    order by year, size DESC
    ;
quit;
data want(drop=_:);
  set inter;
  by year;
  if first.year then _cntCum=1;
  else _cntCum+1;
  if _cntCum/_cntByYear > 0.9 then delete;
run;
Below code returns the result you're showing us in your initial post.
options ps=max ls=max;
data have;
  infile datalines truncover dlm=' ';
  input year firm size indep depen;
  cards;
1 1 4907.98 514.14  515.08
1 2 5045.94 509.85  509.03
1 3 4819.58 505.49  506.34
1 4 4967.02 482.39  482.02
1 5 5094.18 506.08  505.62
1 6 5040.42 501.66  499.79
1 7 4987.62 491.48  489.79
2 1 5086.52 512.95  510.39
2 2 4895.88 503.15  503.28
2 3 5002.13 494.76  495.36
3 1 5078.44 506.06  508.51
3 2 4980.23 520.75  521.21
3 3 4948.08 484.90  485.79
3 4 5050.82 497.25  497.74
3 5 4964.70 498.54  497.97
4 1 4952.31 517.95  517.92
4 2 5027.91 501.90  502.15
4 3 4979.13 512.67  512.07
4 4 5005.56 499.99  501.16
4 5 4905.29 517.62  517.56
4 6 4848.28 479.76  481.32
5 1 1969.43 501.68  502.21
5 2 1931.39 497.89  497.67
5 3 9935.65 505.15  504.44
5 4 9995.49 510.52  511.26
5 5 9008.28 525.02  524.49
6 1 9981.95 518.30  517.78
6 2 15.49 506.13  506.81
7 1 5095.48 517.15  516.85
;
run;
proc sql;
  create table inter as
    select *, sum(size)*0.9 as hurdle
    from have
    group by year
    order by year, size DESC
    ;
quit;
data want;
  set inter;
  by year;
  if first.year then accumulate=size;
  else accumulate+size;
  pick= accumulate<=hurdle;
run;It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
