topic Re: Transforming an unbalanced dataset into a balanced one in New SAS User

Transforming an unbalanced dataset into a balanced one

AlG — Sun, 21 Jun 2020 02:51:08 GMT

Hi Everyone,

I have a dataset which includes weekly information about several companies from 2014 to 2018 (Identifier variables are: CompanyID, Year, Week).

This dataset is unbalanced, in the sense that for some of the companies there is no observation (and hence, no row) for some of the weeks of some of the years. I need to transform this dataset into a balanced one. In other words, I need to drop the year/week observations that do not exist for all of the firms. How can I do this in SAS?

Thanks so much in advance.

Re: Transforming an unbalanced dataset into a balanced one

ed_sas_member — Sun, 21 Jun 2020 07:55:59 GMT

Hi @AlG

Here is an approach to achieve this, using a PROC FREQ and the SPARSE option to report all possible combinations of the variable values, even if a combination does not occur in the data.

data have;
	input CompanyID $ Year Week;
	datalines;
A 2014 1
A 2014 2
A 2014 3
A 2015 1
A 2015 2
A 2015 3
A 2017 1
A 2017 2
B 2014 1
B 2014 3
B 2016 3
B 2017 1
B 2017 2
;

/* Retrieve list of couples Year-Week where at list a company as no observation */
proc freq data=have noprint;
	table CompanyID*Year*Week / sparse out=have_freq (drop=percent);
run;

proc sort data=have_freq (where=(count=0) drop=CompanyID) out=list_couples (drop=count) nodupkey;
	by Year Week;
run;

/* Create table want */
proc sql;
	create table want as
	select a.*
	from have as a 
		 right join
		 (select Year, Week from have
		 except
		 select Year, Week from list_couples) as b
	on a.Year=b.Year and a.Week = b.Week
	order by CompanyID, Year, Week;
quit;

Best,

Re: Transforming an unbalanced dataset into a balanced one

Patrick — Sun, 21 Jun 2020 08:22:35 GMT

If you've got SAS/ETS licensed then you could use Proc Expand and generate the missing data points instead of throwing away actual data points.

Below one way to get what you've asked for.

data have;
	input CompanyID $ Year Week;
	datalines;
A 2014 1
A 2014 2
A 2014 3
A 2015 1
A 2015 2
A 2015 3
A 2017 1
A 2017 2
B 2014 1
B 2014 3
B 2016 3
B 2017 1
B 2017 2
;

proc sql;
  create table want as
  select
    CompanyID,
    Year,
    Week
  from
  (
    select 
      *, 
      count(*) as n_obs_perYearWeek
    from have
    group by Year,Week
  )
  group by CompanyID
  having max(n_obs_perYearWeek)=n_obs_perYearWeek
  order by CompanyID, year, week
  ;
quit;

Re: Transforming an unbalanced dataset into a balanced one

PGStats — Sun, 21 Jun 2020 22:20:50 GMT

This query does it:

proc sql;
create table want as
select 
    a.*
from 
    have as a inner join
    (select year, week from have group by year, week having count(distinct companyID) = 
        (select count(distinct companyID) from have)) as b
            on a.year=b.year and a.week=b.week;
quit;