Solved: Transforming an unbalanced dataset into a balanced one

AlG · Posted 06-20-2020 10:51 PM

Hi Everyone,

I have a dataset which includes weekly information about several companies from 2014 to 2018 (Identifier variables are: CompanyID, Year, Week).

This dataset is unbalanced, in the sense that for some of the companies there is no observation (and hence, no row) for some of the weeks of some of the years. I need to transform this dataset into a balanced one. In other words, I need to drop the year/week observations that do not exist for all of the firms. How can I do this in SAS?

Thanks so much in advance.

PGStats · Posted 06-21-2020 06:20 PM

This query does it:

proc sql;
create table want as
select 
    a.*
from 
    have as a inner join
    (select year, week from have group by year, week having count(distinct companyID) = 
        (select count(distinct companyID) from have)) as b
            on a.year=b.year and a.week=b.week;
quit;

PG

View solution in original post

ed_sas_member · Posted 06-21-2020 03:55 AM

Hi @AlG

Here is an approach to achieve this, using a PROC FREQ and the SPARSE option to report all possible combinations of the variable values, even if a combination does not occur in the data.

data have;
	input CompanyID $ Year Week;
	datalines;
A 2014 1
A 2014 2
A 2014 3
A 2015 1
A 2015 2
A 2015 3
A 2017 1
A 2017 2
B 2014 1
B 2014 3
B 2016 3
B 2017 1
B 2017 2
;

/* Retrieve list of couples Year-Week where at list a company as no observation */
proc freq data=have noprint;
	table CompanyID*Year*Week / sparse out=have_freq (drop=percent);
run;

proc sort data=have_freq (where=(count=0) drop=CompanyID) out=list_couples (drop=count) nodupkey;
	by Year Week;
run;

/* Create table want */
proc sql;
	create table want as
	select a.*
	from have as a 
		 right join
		 (select Year, Week from have
		 except
		 select Year, Week from list_couples) as b
	on a.Year=b.Year and a.Week = b.Week
	order by CompanyID, Year, Week;
quit;

Best,

Patrick · Posted 06-21-2020 04:08 AM

If you've got SAS/ETS licensed then you could use Proc Expand and generate the missing data points instead of throwing away actual data points.

Below one way to get what you've asked for.

data have;
	input CompanyID $ Year Week;
	datalines;
A 2014 1
A 2014 2
A 2014 3
A 2015 1
A 2015 2
A 2015 3
A 2017 1
A 2017 2
B 2014 1
B 2014 3
B 2016 3
B 2017 1
B 2017 2
;

proc sql;
  create table want as
  select
    CompanyID,
    Year,
    Week
  from
  (
    select 
      *, 
      count(*) as n_obs_perYearWeek
    from have
    group by Year,Week
  )
  group by CompanyID
  having max(n_obs_perYearWeek)=n_obs_perYearWeek
  order by CompanyID, year, week
  ;
quit;

PGStats · Posted 06-21-2020 06:20 PM

This query does it:

proc sql;
create table want as
select 
    a.*
from 
    have as a inner join
    (select year, week from have group by year, week having count(distinct companyID) = 
        (select count(distinct companyID) from have)) as b
            on a.year=b.year and a.week=b.week;
quit;

PG

Transforming an unbalanced dataset into a balanced one

Re: Transforming an unbalanced dataset into a balanced one

Re: Transforming an unbalanced dataset into a balanced one

Re: Transforming an unbalanced dataset into a balanced one

Re: Transforming an unbalanced dataset into a balanced one

Transforming an unbalanced dataset into a balanced one

Re: Transforming an unbalanced dataset into a balanced one

Re: Transforming an unbalanced dataset into a balanced one

Re: Transforming an unbalanced dataset into a balanced one

Re: Transforming an unbalanced dataset into a balanced one

The 2025 SAS Hackathon has begun!