Solved: Save number of observations of so many files efficiently

daradanye · Posted 08-21-2019 03:20 PM

Hi guys,

I have the following question,

I have datasets from a2007 to a2018 (12 years). For each year, the data looks like this:

Firm	Plant
a	1
a	2
a	3
b	1
b	2
c	1
c	2
c	3
c	4

I want to summarize how many unique firms/firm-plants combinations for each year and then aggregate them together. What I would like is something like this:

Year	# of firms	# of plants
2007	3	9
…
2018	4	12

I know I can use proc means output to generate a dataset each year. Are there any more efficient ways? By the way, since the dataset of each year is quite large, I tried to append all the years together, But it is very slow and drags down my PC speed.

I will appreciate it very much if someone can help out here. Thanks!

r_behata · Posted 08-21-2019 04:00 PM

data combine;
	set a2007- a2018 indsname=source;
	year=compress(source,,'kd');
	no_firms=1;
	no_plants=1;
run;

proc means data= combine noprint nway;
	var  no_firms no_plants;
	class year;
	output out=want(drop = _:) sum=;
run;

View solution in original post

Reeza · Posted 08-21-2019 03:28 PM

Create a view that appends all the data sets together and run a single proc means on that data set.

daradanye · Posted 08-21-2019 03:34 PM

Hi Reeza,

Thank you. Is there any alternative way instead of appending datasets first? The dataset is too large. It takes a lot of to append and run in aggregate. Thanks!

Reeza · Posted 08-21-2019 03:44 PM

That's why I said to create it as a VIEW, not a data set.

Did you try a VIEW? Then only proc means will be processing the whole data set.

data combined / view=combined;
set a2012-a2017 ;
run;

proc means data=combined;
class year;
var ....

run;

andreas_lds · Posted 08-22-2019 01:46 AM

@daradanye wrote:

Hi Reeza,

Thank you. Is there any alternative way instead of appending datasets first? The dataset is too large. It takes a lot of to append and run in aggregate. Thanks!

To large? How many obs do you have per dataset? Can you post the proc means you are using? Executing proc means for each dataset and appending the results could solve the performance issue.

r_behata · Posted 08-21-2019 04:00 PM

data combine;
	set a2007- a2018 indsname=source;
	year=compress(source,,'kd');
	no_firms=1;
	no_plants=1;
run;

proc means data= combine noprint nway;
	var  no_firms no_plants;
	class year;
	output out=want(drop = _:) sum=;
run;

KachiM · Posted 08-22-2019 12:48 PM

@daradanye

Have you gotten the answer you want? If not, describe the issues you face.

KachiM · Posted 08-22-2019 01:02 PM

@daradanye

The solution given by @r_behata gives same number for no_firms and no_plants as I have tested. The distinct number of firms within a Data Set must be less than or equal to the number of observations. Have I missed anything here?

Save number of observations of so many files efficiently

Re: Save number of observations of so many files efficiently

Re: Save number of observations of so many files efficiently

Re: Save number of observations of so many files efficiently

Re: Save number of observations of so many files efficiently

Re: Save number of observations of so many files efficiently

Re: Save number of observations of so many files efficiently

Re: Save number of observations of so many files efficiently

Re: Save number of observations of so many files efficiently

Registration is open