Solved: Merging two datasets with missing values

AlG · Posted 05-30-2020 02:37 PM

Hi friends,

I have two datasets ABC and DEF. They include weekly data about some companies during 2014-2018 (The companies listed in DEF are a subset of companies in ABC). ABC includes sales data and DEF includes some advertising expenditure data.

My problem is that whereas ABC does include all of the weeks of that 4-year period, DEF lacks some of the weeks (probably the company did not have any ad expenditure during that week and the builder of the dataset decided to ignore that week altogether, instead of dedicating a row to that week and listing ad expenditure as zero).

I want to fix this issue in my merged dataset. I want to build a merged file in such a way that it does include those problematic weeks (i.e. it should include the week and list the sales data, and include blank, or preferably zero, for ad expenditure data).

If I did not have this issue, I think the following code would give me the merged file. But how can I modify it to account for those problematic weeks? Thanks so much in advance!

PROC SQL;
create table FinalMergedDataset as SELECT *
 FROM DEF, ABC
 WHERE DEF.MergeName=ABC.brand_name and DEF.Year=ABC.year0 and DEF.GovWeek=ABC.week;
QUIT;

Patrick · Posted 06-13-2020 09:57 PM

There is a comma missing after DEF.*

You can't have the same named column twice in a Select statement. SAS will not throw an error but it will just pick one of the same named columns.

The moment you need to modify a column you shouldn't use the * syntax anymore but you should list the columns one by one.

What can help avoid a lot of typing - run your SQL with options FEEDBACK NOEXEC

proc sql feedback noexec;
  select c.*
  from sashelp.class c
  ;
quit;

This will write the * syntax expanded to the SAS log

NOTE: Statement transforms to:

        select C.Name, C.Sex, C.Age, C.Height, C.Weight
          from SASHELP.CLASS C;

If there are many variables where I have to change something in the same way I normally copy/paste the variable list into Notepad++ and use RegEx search/replace like:

1. Each variable on its own line

->

2. coalesce() around variables

3. Remove the alias from the new variables

...and now just copy/paste the result back into your SAS SQL code

Above was for demonstration and you need some RegEx experience to use this approach efficiently - but you can of course also use it for partial changes and then do the rest manually.

Alternatively for your actual problem: Use post processing like a SAS data step with array processing where you set missing values to zero.

....and should you only need the zero's for reporting/printing then also consider to just set options missing='0'. This will print missings as zero in reports.

View solution in original post

mkeintz · Posted 05-30-2020 10:56 PM

Instead of requesting the intersection of ABC and DEF, do a left join:

PROC SQL;
create table FinalMergedDataset as SELECT *
 FROM ABC  left join DEF
 ON DEF.MergeName=ABC.brand_name and DEF.Year=ABC.year0 and DEF.GovWeek=ABC.week;
QUIT;

Any instance of ABC not matched in DEF will generate missing values for the def-specific variables.

Now if you need zeroes instead of missing, you need to know the def variables names. Say they are AD1 and AD2:

PROC SQL;
create table FinalMergedDataset as
   SELECT ABC.*,  
     coalesce(def.a1,0) as a1,
     coalesce(def.a2,0) as a2
 FROM ABC left join DEF
 ON DEF.MergeName=ABC.brand_name and DEF.Year=ABC.year0 and DEF.GovWeek=ABC.week;
QUIT;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

AlG · Posted 05-31-2020 02:44 PM

@mkeintz Thanks so much. I was a bit confused about one aspect of your code. Let me use an example:

Let's say the ABC dataset (sales data) that I mentioned, includes three companies: A, B, C.

The DEF dataset (ad expenditure data) includes two companies: A, B (as I said earlier, the companies in DEF are a subset of those in ABC).

Both datasets include data on 100 weeks. For company A, ad expenditure and sales data for all 100 weeks are available in the datasets. For company B, 100 weeks of sales data is available in ABC, but only, say, 75 weeks of ad expenditure data is available in DEF (as I said, maybe company B didn't spend anything on advertising during the other 25 weeks and the dataset builder just ignored those weeks, instead of including those weeks and assigning a zero value).

What I want to have in my merged dataset is:

--100 rows for company A, which would include sales data for 100 weeks and ad expenditure data for 100 weeks.

--100 rows for company B, which would include sales data for 100 weeks and ad expenditure data for 75 weeks and assign zero to ad expenditure for the 25 weeks missing in the DEF dataset.

--0 rows for company C.

Do the codes that you kindly provided give such an output?

Thanks again!

mkeintz · Posted 06-01-2020 04:50 PM

@AlG wrote:

@mkeintz Thanks so much. I was a bit confused about one aspect of your code. Let me use an example:

Let's say the ABC dataset (sales data) that I mentioned, includes three companies: A, B, C.

The DEF dataset (ad expenditure data) includes two companies: A, B (as I said earlier, the companies in DEF are a subset of those in ABC).

Both datasets include data on 100 weeks. For company A, ad expenditure and sales data for all 100 weeks are available in the datasets. For company B, 100 weeks of sales data is available in ABC, but only, say, 75 weeks of ad expenditure data is available in DEF (as I said, maybe company B didn't spend anything on advertising during the other 25 weeks and the dataset builder just ignored those weeks, instead of including those weeks and assigning a zero value).

What I want to have in my merged dataset is:

--100 rows for company A, which would include sales data for 100 weeks and ad expenditure data for 100 weeks.

--100 rows for company B, which would include sales data for 100 weeks and ad expenditure data for 75 weeks and assign zero to ad expenditure for the 25 weeks missing in the DEF dataset.

--0 rows for company C.

Do the codes that you kindly provided give such an output?

Thanks again!

The answer is best determined by running the code. Make a dataset of, say 10 weeks total for 2 companies, with all 10 weeks reported in the sales (say 3 variables) data for each company, and a subset of weeks reported in the ad expenditures, with a couple of variables. Run the code, look at the results. This is half the fun of programming - testing code, then going back if necessary to fix or improve it. Welcome to the club.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

AlG · Posted 06-13-2020 12:42 PM

@mkeintz Thanks! I tried the following code. The issue that I am facing is that for non-matching observations the value for for week_volume is still missing.

PROC SQL;
create table Want as
   SELECT ABC.*, DEF.* 
     COALESCE(DEF.Week_Volume,0) as Week_Volume,
 FROM ABC left join DEF
 ON DEF.MergeName=ABC.brand_name and DEF.Year=ABC.year0 and DEF.GovWeek=ABC.week;
QUIT;

However, when I create a new variable (instead of replacing the missing values in the original variable), it seems to give me what I want (i.e. zero for non-matching observations):

PROC SQL;
create table Want as
   SELECT ABC.*, DEF.* 
     COALESCE(DEF.Week_Volume,0) as Week_Volume2,
 FROM ABC left join DEF
 ON DEF.MergeName=ABC.brand_name and DEF.Year=ABC.year0 and DEF.GovWeek=ABC.week;
QUIT;

Given that I want to repeat the same thing for a lot of other variables too, it is really tedious to create new variables for each and then drop the original variables. Can you please tell me what the issue is with my first code above? Thanks again!

Patrick · Posted 06-13-2020 09:57 PM

There is a comma missing after DEF.*

You can't have the same named column twice in a Select statement. SAS will not throw an error but it will just pick one of the same named columns.

The moment you need to modify a column you shouldn't use the * syntax anymore but you should list the columns one by one.

What can help avoid a lot of typing - run your SQL with options FEEDBACK NOEXEC

proc sql feedback noexec;
  select c.*
  from sashelp.class c
  ;
quit;

This will write the * syntax expanded to the SAS log

NOTE: Statement transforms to:

        select C.Name, C.Sex, C.Age, C.Height, C.Weight
          from SASHELP.CLASS C;

If there are many variables where I have to change something in the same way I normally copy/paste the variable list into Notepad++ and use RegEx search/replace like:

1. Each variable on its own line

->

2. coalesce() around variables

3. Remove the alias from the new variables

...and now just copy/paste the result back into your SAS SQL code

Above was for demonstration and you need some RegEx experience to use this approach efficiently - but you can of course also use it for partial changes and then do the rest manually.

Alternatively for your actual problem: Use post processing like a SAS data step with array processing where you set missing values to zero.

....and should you only need the zero's for reporting/printing then also consider to just set options missing='0'. This will print missings as zero in reports.

AlG · Posted 06-14-2020 01:30 AM

@Patrick @mkeintz

Thank you so much for your help!

Merging two datasets with missing values

Re: Merging two datasets with missing values

Re: Merging two datasets with missing values

Re: Merging two datasets with missing values

Re: Merging two datasets with missing values

Re: Merging two datasets with missing values

Re: Merging two datasets with missing values

Re: Merging two datasets with missing values

SAS Innovate 2025: Save the Date