Hi friends,
I have two datasets ABC and DEF. They include weekly data about some companies during 2014-2018 (The companies listed in DEF are a subset of companies in ABC). ABC includes sales data and DEF includes some advertising expenditure data.
My problem is that whereas ABC does include all of the weeks of that 4-year period, DEF lacks some of the weeks (probably the company did not have any ad expenditure during that week and the builder of the dataset decided to ignore that week altogether, instead of dedicating a row to that week and listing ad expenditure as zero).
I want to fix this issue in my merged dataset. I want to build a merged file in such a way that it does include those problematic weeks (i.e. it should include the week and list the sales data, and include blank, or preferably zero, for ad expenditure data).
If I did not have this issue, I think the following code would give me the merged file. But how can I modify it to account for those problematic weeks? Thanks so much in advance!
PROC SQL;
create table FinalMergedDataset as SELECT *
FROM DEF, ABC
WHERE DEF.MergeName=ABC.brand_name and DEF.Year=ABC.year0 and DEF.GovWeek=ABC.week;
QUIT;
There is a comma missing after DEF.*
You can't have the same named column twice in a Select statement. SAS will not throw an error but it will just pick one of the same named columns.
The moment you need to modify a column you shouldn't use the * syntax anymore but you should list the columns one by one.
What can help avoid a lot of typing - run your SQL with options FEEDBACK NOEXEC
proc sql feedback noexec;
select c.*
from sashelp.class c
;
quit;
This will write the * syntax expanded to the SAS log
NOTE: Statement transforms to: select C.Name, C.Sex, C.Age, C.Height, C.Weight from SASHELP.CLASS C;
If there are many variables where I have to change something in the same way I normally copy/paste the variable list into Notepad++ and use RegEx search/replace like:
1. Each variable on its own line
->
2. coalesce() around variables
3. Remove the alias from the new variables
...and now just copy/paste the result back into your SAS SQL code
Above was for demonstration and you need some RegEx experience to use this approach efficiently - but you can of course also use it for partial changes and then do the rest manually.
Alternatively for your actual problem: Use post processing like a SAS data step with array processing where you set missing values to zero.
....and should you only need the zero's for reporting/printing then also consider to just set options missing='0'. This will print missings as zero in reports.
Instead of requesting the intersection of ABC and DEF, do a left join:
PROC SQL;
create table FinalMergedDataset as SELECT *
FROM ABC left join DEF
ON DEF.MergeName=ABC.brand_name and DEF.Year=ABC.year0 and DEF.GovWeek=ABC.week;
QUIT;
Any instance of ABC not matched in DEF will generate missing values for the def-specific variables.
Now if you need zeroes instead of missing, you need to know the def variables names. Say they are AD1 and AD2:
PROC SQL;
create table FinalMergedDataset as
SELECT ABC.*,
coalesce(def.a1,0) as a1,
coalesce(def.a2,0) as a2
FROM ABC left join DEF
ON DEF.MergeName=ABC.brand_name and DEF.Year=ABC.year0 and DEF.GovWeek=ABC.week;
QUIT;
@mkeintz Thanks so much. I was a bit confused about one aspect of your code. Let me use an example:
Let's say the ABC dataset (sales data) that I mentioned, includes three companies: A, B, C.
The DEF dataset (ad expenditure data) includes two companies: A, B (as I said earlier, the companies in DEF are a subset of those in ABC).
Both datasets include data on 100 weeks. For company A, ad expenditure and sales data for all 100 weeks are available in the datasets. For company B, 100 weeks of sales data is available in ABC, but only, say, 75 weeks of ad expenditure data is available in DEF (as I said, maybe company B didn't spend anything on advertising during the other 25 weeks and the dataset builder just ignored those weeks, instead of including those weeks and assigning a zero value).
What I want to have in my merged dataset is:
--100 rows for company A, which would include sales data for 100 weeks and ad expenditure data for 100 weeks.
--100 rows for company B, which would include sales data for 100 weeks and ad expenditure data for 75 weeks and assign zero to ad expenditure for the 25 weeks missing in the DEF dataset.
--0 rows for company C.
Do the codes that you kindly provided give such an output?
Thanks again!
@AlG wrote:
@mkeintz Thanks so much. I was a bit confused about one aspect of your code. Let me use an example:
Let's say the ABC dataset (sales data) that I mentioned, includes three companies: A, B, C.
The DEF dataset (ad expenditure data) includes two companies: A, B (as I said earlier, the companies in DEF are a subset of those in ABC).
Both datasets include data on 100 weeks. For company A, ad expenditure and sales data for all 100 weeks are available in the datasets. For company B, 100 weeks of sales data is available in ABC, but only, say, 75 weeks of ad expenditure data is available in DEF (as I said, maybe company B didn't spend anything on advertising during the other 25 weeks and the dataset builder just ignored those weeks, instead of including those weeks and assigning a zero value).
What I want to have in my merged dataset is:
--100 rows for company A, which would include sales data for 100 weeks and ad expenditure data for 100 weeks.
--100 rows for company B, which would include sales data for 100 weeks and ad expenditure data for 75 weeks and assign zero to ad expenditure for the 25 weeks missing in the DEF dataset.
--0 rows for company C.
Do the codes that you kindly provided give such an output?
Thanks again!
The answer is best determined by running the code. Make a dataset of, say 10 weeks total for 2 companies, with all 10 weeks reported in the sales (say 3 variables) data for each company, and a subset of weeks reported in the ad expenditures, with a couple of variables. Run the code, look at the results. This is half the fun of programming - testing code, then going back if necessary to fix or improve it. Welcome to the club.
@mkeintz Thanks! I tried the following code. The issue that I am facing is that for non-matching observations the value for for week_volume is still missing.
PROC SQL;
create table Want as
SELECT ABC.*, DEF.*
COALESCE(DEF.Week_Volume,0) as Week_Volume,
FROM ABC left join DEF
ON DEF.MergeName=ABC.brand_name and DEF.Year=ABC.year0 and DEF.GovWeek=ABC.week;
QUIT;
However, when I create a new variable (instead of replacing the missing values in the original variable), it seems to give me what I want (i.e. zero for non-matching observations):
PROC SQL;
create table Want as
SELECT ABC.*, DEF.*
COALESCE(DEF.Week_Volume,0) as Week_Volume2,
FROM ABC left join DEF
ON DEF.MergeName=ABC.brand_name and DEF.Year=ABC.year0 and DEF.GovWeek=ABC.week;
QUIT;
Given that I want to repeat the same thing for a lot of other variables too, it is really tedious to create new variables for each and then drop the original variables. Can you please tell me what the issue is with my first code above? Thanks again!
There is a comma missing after DEF.*
You can't have the same named column twice in a Select statement. SAS will not throw an error but it will just pick one of the same named columns.
The moment you need to modify a column you shouldn't use the * syntax anymore but you should list the columns one by one.
What can help avoid a lot of typing - run your SQL with options FEEDBACK NOEXEC
proc sql feedback noexec;
select c.*
from sashelp.class c
;
quit;
This will write the * syntax expanded to the SAS log
NOTE: Statement transforms to: select C.Name, C.Sex, C.Age, C.Height, C.Weight from SASHELP.CLASS C;
If there are many variables where I have to change something in the same way I normally copy/paste the variable list into Notepad++ and use RegEx search/replace like:
1. Each variable on its own line
->
2. coalesce() around variables
3. Remove the alias from the new variables
...and now just copy/paste the result back into your SAS SQL code
Above was for demonstration and you need some RegEx experience to use this approach efficiently - but you can of course also use it for partial changes and then do the rest manually.
Alternatively for your actual problem: Use post processing like a SAS data step with array processing where you set missing values to zero.
....and should you only need the zero's for reporting/printing then also consider to just set options missing='0'. This will print missings as zero in reports.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.