Hey everyone
I'd like to automate a process I usually do in excel.
Here is my question: after a lot of basic calculations (I use PROC SQL to write my code) I come up with 5 different tables (one for each year: 2018-2022) that list around 150 different products and some economic stats about them. Now I would need to come up with a FINAL list that includes the productcodes (labelled: k_YEAR) of (1) 2022 (the most recent data point), and (2) three out of the four years prior to 2022 (e.g. 2020, 2019 and 2017; 2019, 2018 and 2017). Sadly I haven't found a way how to code this in PROC SQL nor in SAS code. In other words, SAS should keep the k_YEAR values for which it finds a match in 3 out of 4 cases and drop it if it doesn't. My tables are named like this, e.g. 2019: FINAL_2019_EU27.
Can someone help me out? If you need more information, feel free to ask. I am using SAS 9.4.
@Invictus wrote:
Hey Tom thank you for your propositions. The first proposition does not give me unique values and simply gives me a stacked list of all products my 5 different datasets have in common. Quid?
Yes. Your were not explicit about what you wanted. The interleaving method is the easiest as it does require you to know it advance how many datasets you are starting with.
If the goal is just to get the list of K_YEAR values then it does not matter about the other variables then the MERGE with the IN= dataset options seems to be the simplest. You can just the copy/paste features of your editor to create some "wallpaper" code.
I cannot remember your convoluted dataset or variable names but the pattern of the code would be like this:
data want;
merge
y2018 (in=in2018)
y2019 (in=in2019)
y2020 (in=in2020)
y2021 (in=in2021)
y2022 (in=in2022)
;
by k_year ;
if in2022 or sum(of in2018-in2022) > 2 ;
keep k_year description;
run;
Dear Cynthia
Thank you for your quick response. I included a snippet of a table for the year 2019. This is an example of 1 tabel out of 5 I have, one for each year (2018, 2019, 2020, 2021 and 2022). So I have 5 tables like this where the variable name k_YEAR, in this case 2019, is my main indicator. In that column you can find around 150 observations consisting of productcodes. As you already suggested, these tables need to be matched on the observations in k_YEAR. You are correct that I was referring to an observation (in this case a six-digit number) and not to the variable name, sorry for my vagueness.
The creation of a final table with unique product codes (observations under k_YEAR) is my goal. This table should be created under two conditions 1) if the productcode appears in 2022 (my most recent data point) then the productcode should automatically be selected regardless of the other years, and 2) for the remaining years (2018-2021) that productcode should only be selected if it appears in 3 out of the 4 years/tables. That means that a product should not be selected when it only appears in the tables for 2018 and 2019, thus only appearing twice as opposed to thrice. In the example you gave the product should automatically be selected because it appears in the 2022 data, my first criterium. I hope this clears up my question. If you have more, don't hesitate to ask.
Posting a photograph of your data helps a little bit, but posting actual data as TEXT is going to be much easier to work with.
So assuming the 5 datasets are sorted by K_YEAR you can
Stack (or interleave) the observations by the value of K_YEAR. If you don't have an actual variable with the YEAR value then you could use the INDSNAME= SET statement option to make a temporary variable which you could use to derive the YEAR.
Note I am too lazy to type out all of your dataset names. If you had placed the YEAR at the end it would have been easier since you could have used a dataset name list.
data want;
set final_2018_eu27 ..... final_2022_eu27 indsname=dsname ;
by k_year ;
year = input(scan(dsname,-2,'_'),4.);
run;
Now you will need to calculate across observations. PROC SQL is useful for that.
proc sql;
create subset as
select * from want
group by k_year
having max(year=2019)=1 or count(*)>2
;
quit;
And you will now have between 1 and 5 observations for each of the K_YEAR values that meet your criteria.
Or merge them together, in which case you will need to find new variable names for the common variables. Let's assume that is only the IMPORTANCE.... variable. If there are more variables then this does not seem like the right way to handle this problem.
data want;
merge
final_2018_eu27(rename=(importance_2018_eu27=imp2018)
...
final_2022_eu27(rename=(importance_2022_eu27=imp2022)
;
by k_year ;
run;
You can find the subset you want with this test.
data subset;
set want;
if not missing(imp2019) or n(of imp2018-2022)>2;
run;
Note you could have just included the subsetting IF statement in your first data step.
Hey Tom thank you for your propositions. The first proposition does not give me unique values and simply gives me a stacked list of all products my 5 different datasets have in common. Quid?
@Invictus wrote:
Hey Tom thank you for your propositions. The first proposition does not give me unique values and simply gives me a stacked list of all products my 5 different datasets have in common. Quid?
Yes. Your were not explicit about what you wanted. The interleaving method is the easiest as it does require you to know it advance how many datasets you are starting with.
If the goal is just to get the list of K_YEAR values then it does not matter about the other variables then the MERGE with the IN= dataset options seems to be the simplest. You can just the copy/paste features of your editor to create some "wallpaper" code.
I cannot remember your convoluted dataset or variable names but the pattern of the code would be like this:
data want;
merge
y2018 (in=in2018)
y2019 (in=in2019)
y2020 (in=in2020)
y2021 (in=in2021)
y2022 (in=in2022)
;
by k_year ;
if in2022 or sum(of in2018-in2022) > 2 ;
keep k_year description;
run;
Hi:
I don't have a lot of time to make more fake data for visualization purposes, and I still have questions about the little bit of data that you showed. However, extrapolating from the little snippet, here's how I visualize what I think you described:
To make things simple, I shortened the description. It wasn't clear to me whether each file had a separate DESCR_HS92 or not. Also not clear to me whether every "importance" variable had a year in the variable name. However, based on what I think you described, product code 1111 in my fake data IS in the 2022 table, but there are only 2 years (2021 and 2020) that have 1111 as a product code, so that row would not meet the condition. The same with product code 2222 in my fake data. 2222 is in the 2022 table, but there are only 2 years(2020 and 2019) with product code 2222, so that row would not meet the condition. Finally, product code 4444 is in the 2022 table, and there are 3 years with product code 4444 (2021, 2019 and 2018), so I think that of the 3 rows originally found in the fake 2022 data, only the row with product code 4444 meets the condition that I think you outlined. If each file does have the DESCR_HS92 variable, are they always the same for every product code or are they different. Do you want to keep all the IMPORTANCE.... variables for all the matches? Still not clear to me what the final list, report or dataset is supposed to look like.
I think it's not a good idea to bury a year in a variable name if K_2022 and K_2021 are all product codes, they why not just call them PRODUCTCODE or PCODE or something that you could use in a MERGE. PROC SQL would join them without a common name, but you did not show the SQL that you tried that did not give you the result you said you wanted. So at this point, I've fizzled out on ideas.
Cynthia
Okay to make things easier: as output I would only want two columns as an endproduct of my two conditions, i.e. k_year and descr_hs92, the importance variable I am not interested in and should thus not be considered.
The fake visualization you made is perfect and represents how my data is structured for each year I have (2018-2022).
@Cynthia_sas wrote:... product code 1111 in my fake data IS in the 2022 table, but there are only 2 years (2021 and 2020) that have 1111 as a product code, so that row would not meet the condition.
This statement is incorrect, as I pointed out if the productcode (observation of k_year) appears in 2022 then the productcode should automatically be selected regardless of whether it only comes up once or twice in the the other years prior to 2022. The description for a product is always the same yes. I haven't tried any code because I simply do not know how to translate my two conditions to code.
This table should be created under two conditions 1) if the productcode appears in 2022 (my most recent data point) then the productcode should automatically be selected regardless of the other years, and 2) for the remaining years (2018-2021) that productcode should only be selected if it appears in 3 out of the 4 years/tables.
This seems easy enough if i understand correctly.
data IN2018 IN2019 IN2020 IN2021 IN2022;
do K_2019 = 1 to 150;
DESC = cat('product #', K_2019);
if ranuni(1) < .95 then output IN2018;
if ranuni(1) < .95 then output IN2019;
if ranuni(1) < .95 then output IN2020;
if ranuni(1) < .95 then output IN2021;
if ranuni(1) < .95 then output IN2022;
end;
run;
data WANT;
merge IN2018(in=_IN2018)
IN2019(in=_IN2019)
IN2020(in=_IN2020)
IN2021(in=_IN2021)
IN2022(in=_IN2022);
by K_2019;
if _IN2022 %* 1) if the productcode appears in 2022 ;
| sum(of _IN2018-_IN2021)>2; %* 2) if it appears in 3+ out of the 4 years/tables;
run;
The same logic can be written in SQL but will be a lot more verbose.
Something like this pseudo code:
create table WANT as
select coalesce (all K_2019), coalesce (all DESC), coalesce (other variables as needed)
from full join of all tables on K_2019
where IN2022.K_2019 is not missing
or count of other missing K_2019 less than 2;
This solution makes me lose the productcodes (stored under k_year) and the respective description of a product? Do you know why this is? You do seem to have understood my criteria, so I feel we are almost there! I renamed all my k_2019, ..., k_2022 to k_year. Ideally the final list should now also include the respective productcodes and descriptions as opposed to 1,2,3,... codes that we have here now.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.