Solved: Re: How do I merge different tables with a condition?

Invictus · Posted 07-10-2024 07:43 AM

Hey everyone

I'd like to automate a process I usually do in excel.

Here is my question: after a lot of basic calculations (I use PROC SQL to write my code) I come up with 5 different tables (one for each year: 2018-2022) that list around 150 different products and some economic stats about them. Now I would need to come up with a FINAL list that includes the productcodes (labelled: k_YEAR) of (1) 2022 (the most recent data point), and (2) three out of the four years prior to 2022 (e.g. 2020, 2019 and 2017; 2019, 2018 and 2017). Sadly I haven't found a way how to code this in PROC SQL nor in SAS code. In other words, SAS should keep the k_YEAR values for which it finds a match in 3 out of 4 cases and drop it if it doesn't. My tables are named like this, e.g. 2019: FINAL_2019_EU27.

Can someone help me out? If you need more information, feel free to ask. I am using SAS 9.4.

Tom · Posted 07-11-2024 11:50 AM

@Invictus wrote:

Hey Tom thank you for your propositions. The first proposition does not give me unique values and simply gives me a stacked list of all products my 5 different datasets have in common. Quid?

Yes. Your were not explicit about what you wanted. The interleaving method is the easiest as it does require you to know it advance how many datasets you are starting with.

If the goal is just to get the list of K_YEAR values then it does not matter about the other variables then the MERGE with the IN= dataset options seems to be the simplest. You can just the copy/paste features of your editor to create some "wallpaper" code.

I cannot remember your convoluted dataset or variable names but the pattern of the code would be like this:

data want;
   merge
     y2018 (in=in2018) 
     y2019 (in=in2019) 
     y2020 (in=in2020) 
     y2021 (in=in2021) 
     y2022 (in=in2022)
   ;
   by k_year ;
   if in2022 or sum(of in2018-in2022) > 2 ;
   keep k_year description;
run;

View solution in original post

Cynthia_sas · Posted 07-10-2024 09:23 AM

Hi: Without data, it is impossible to figure out what you want. So you said your productcodes were "labelled" k_Year...that's nice, but what is the variable name? You said that 2022 was the most recent data point, but 2022 has to be the value of a variable. What is that variable named? You said "keep the k_Year values for which it finds a match in 3 out of 4 cases" -- match on what? productcodes? Year? some other variable? You also said "drop "it" if "it" doesn't <match>" -- It makes a difference here, what the meaning of "it" is. I believe you're referring to a variable or an observation, however, not clear what needs to drop. Your logic to keep 3 out of the 4 years prior to 2022 -- again, isn't clear 3 out of the 4 years for each product, perhaps, But you have 5 tables. What if a product is in the 2018, 2021 and 2022 table, but is NOT in the 2019 or 2020 table? What happens then? Could this scenario occur? Your description of the data doesn't make sense. In order to help you, someone would need to mock up some sample data on nearly no information. Can you provide sample data? Fake data? You said you want a final "list" is this a final report or a final dataset?
Unfortunately, more information is needed and possibly some sample data and an example of what the output from your sample data should look like.
Cynthia

Invictus · Posted 07-10-2024 09:48 AM

Dear Cynthia

Thank you for your quick response. I included a snippet of a table for the year 2019. This is an example of 1 tabel out of 5 I have, one for each year (2018, 2019, 2020, 2021 and 2022). So I have 5 tables like this where the variable name k_YEAR, in this case 2019, is my main indicator. In that column you can find around 150 observations consisting of productcodes. As you already suggested, these tables need to be matched on the observations in k_YEAR. You are correct that I was referring to an observation (in this case a six-digit number) and not to the variable name, sorry for my vagueness.

The creation of a final table with unique product codes (observations under k_YEAR) is my goal. This table should be created under two conditions 1) if the productcode appears in 2022 (my most recent data point) then the productcode should automatically be selected regardless of the other years, and 2) for the remaining years (2018-2021) that productcode should only be selected if it appears in 3 out of the 4 years/tables. That means that a product should not be selected when it only appears in the tables for 2018 and 2019, thus only appearing twice as opposed to thrice. In the example you gave the product should automatically be selected because it appears in the 2022 data, my first criterium. I hope this clears up my question. If you have more, don't hesitate to ask.

Tom · Posted 07-10-2024 10:17 AM

Posting a photograph of your data helps a little bit, but posting actual data as TEXT is going to be much easier to work with.

So assuming the 5 datasets are sorted by K_YEAR you can

Stack (or interleave) the observations by the value of K_YEAR. If you don't have an actual variable with the YEAR value then you could use the INDSNAME= SET statement option to make a temporary variable which you could use to derive the YEAR.

Note I am too lazy to type out all of your dataset names. If you had placed the YEAR at the end it would have been easier since you could have used a dataset name list.

data want;
   set final_2018_eu27 ..... final_2022_eu27 indsname=dsname ;
   by k_year ;
   year = input(scan(dsname,-2,'_'),4.);
run;

Now you will need to calculate across observations. PROC SQL is useful for that.

proc sql;
create subset as 
  select * from want
  group by k_year
  having max(year=2019)=1 or count(*)>2
;
quit;

And you will now have between 1 and 5 observations for each of the K_YEAR values that meet your criteria.

Or merge them together, in which case you will need to find new variable names for the common variables. Let's assume that is only the IMPORTANCE.... variable. If there are more variables then this does not seem like the right way to handle this problem.

data want;
   merge
     final_2018_eu27(rename=(importance_2018_eu27=imp2018)
...
     final_2022_eu27(rename=(importance_2022_eu27=imp2022)
   ;
   by k_year ;
run;

You can find the subset you want with this test.

data subset;
  set want;
  if not missing(imp2019) or n(of imp2018-2022)>2;
run;

Note you could have just included the subsetting IF statement in your first data step.

Invictus · Posted 07-11-2024 09:54 AM

Hey Tom thank you for your propositions. The first proposition does not give me unique values and simply gives me a stacked list of all products my 5 different datasets have in common. Quid?

Tom · Posted 07-11-2024 11:50 AM

@Invictus wrote:

Hey Tom thank you for your propositions. The first proposition does not give me unique values and simply gives me a stacked list of all products my 5 different datasets have in common. Quid?

Yes. Your were not explicit about what you wanted. The interleaving method is the easiest as it does require you to know it advance how many datasets you are starting with.

If the goal is just to get the list of K_YEAR values then it does not matter about the other variables then the MERGE with the IN= dataset options seems to be the simplest. You can just the copy/paste features of your editor to create some "wallpaper" code.

I cannot remember your convoluted dataset or variable names but the pattern of the code would be like this:

data want;
   merge
     y2018 (in=in2018) 
     y2019 (in=in2019) 
     y2020 (in=in2020) 
     y2021 (in=in2021) 
     y2022 (in=in2022)
   ;
   by k_year ;
   if in2022 or sum(of in2018-in2022) > 2 ;
   keep k_year description;
run;

Cynthia_sas · Posted 07-10-2024 02:03 PM

Hi:

I don't have a lot of time to make more fake data for visualization purposes, and I still have questions about the little bit of data that you showed. However, extrapolating from the little snippet, here's how I visualize what I think you described:

To make things simple, I shortened the description. It wasn't clear to me whether each file had a separate DESCR_HS92 or not. Also not clear to me whether every "importance" variable had a year in the variable name. However, based on what I think you described, product code 1111 in my fake data IS in the 2022 table, but there are only 2 years (2021 and 2020) that have 1111 as a product code, so that row would not meet the condition. The same with product code 2222 in my fake data. 2222 is in the 2022 table, but there are only 2 years(2020 and 2019) with product code 2222, so that row would not meet the condition. Finally, product code 4444 is in the 2022 table, and there are 3 years with product code 4444 (2021, 2019 and 2018), so I think that of the 3 rows originally found in the fake 2022 data, only the row with product code 4444 meets the condition that I think you outlined. If each file does have the DESCR_HS92 variable, are they always the same for every product code or are they different. Do you want to keep all the IMPORTANCE.... variables for all the matches? Still not clear to me what the final list, report or dataset is supposed to look like.

I think it's not a good idea to bury a year in a variable name if K_2022 and K_2021 are all product codes, they why not just call them PRODUCTCODE or PCODE or something that you could use in a MERGE. PROC SQL would join them without a common name, but you did not show the SQL that you tried that did not give you the result you said you wanted. So at this point, I've fizzled out on ideas.

Cynthia

Invictus · Posted 07-11-2024 09:42 AM

Okay to make things easier: as output I would only want two columns as an endproduct of my two conditions, i.e. k_year and descr_hs92, the importance variable I am not interested in and should thus not be considered.

The fake visualization you made is perfect and represents how my data is structured for each year I have (2018-2022).

@Cynthia_sas wrote:

... product code 1111 in my fake data IS in the 2022 table, but there are only 2 years (2021 and 2020) that have 1111 as a product code, so that row would not meet the condition.

This statement is incorrect, as I pointed out if the productcode (observation of k_year) appears in 2022 then the productcode should automatically be selected regardless of whether it only comes up once or twice in the the other years prior to 2022. The description for a product is always the same yes. I haven't tried any code because I simply do not know how to translate my two conditions to code.

ChrisNZ · Posted 07-11-2024 07:20 AM

This table should be created under two conditions 1) if the productcode appears in 2022 (my most recent data point) then the productcode should automatically be selected regardless of the other years, and 2) for the remaining years (2018-2021) that productcode should only be selected if it appears in 3 out of the 4 years/tables.

This seems easy enough if i understand correctly.

data IN2018 IN2019 IN2020 IN2021 IN2022;
  do K_2019 = 1 to 150;
    DESC = cat('product #', K_2019);
    if ranuni(1) < .95 then output IN2018;
    if ranuni(1) < .95 then output IN2019;
    if ranuni(1) < .95 then output IN2020;
    if ranuni(1) < .95 then output IN2021;
    if ranuni(1) < .95 then output IN2022;
  end;
run;  

data WANT;
  merge IN2018(in=_IN2018)
        IN2019(in=_IN2019)
        IN2020(in=_IN2020)
        IN2021(in=_IN2021)
        IN2022(in=_IN2022);
  by K_2019;
  if _IN2022                     %* 1) if the productcode appears in 2022 ;
   | sum(of _IN2018-_IN2021)>2;  %* 2) if it appears in 3+ out of the 4 years/tables;
run;

The same logic can be written in SQL but will be a lot more verbose.
Something like this pseudo code:

  create table WANT as 
  select coalesce (all K_2019), coalesce (all DESC), coalesce (other variables as needed)
  from full join of all tables on K_2019
  where IN2022.K_2019 is not missing
     or count of other missing K_2019 less than 2;

High-Performance SAS Coding - Third Edition

Invictus · Posted 07-11-2024 09:50 AM

This solution makes me lose the productcodes (stored under k_year) and the respective description of a product? Do you know why this is? You do seem to have understood my criteria, so I feel we are almost there! I renamed all my k_2019, ..., k_2022 to k_year. Ideally the final list should now also include the respective productcodes and descriptions as opposed to 1,2,3,... codes that we have here now.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away