Re: Best way to proc merge?

Astounding · Posted 03-16-2018 08:33 PM

It's a little too much to do a detailed analysis here, but ...

Only one data set is subset to contain CEOs only: CEO

Other data sets are being merged in, but are not subset in this way. The mismatches can introduce non-CEO observations (or at least can introduce missing values).

sastuck · Posted 03-17-2018 10:23 AM

Is there a simple way to make the "keep only CEOs" code overwrite the dataset I already have so that I don't have this issue? I don't need the non-CEO data, so I would like to just get rid of it.

Thanks!

Astounding · Posted 03-17-2018 11:36 AM

That's not the issue here. You need to learn about this program:

data combined;

merge ceo (in=a) other_data (in=b);

by id;

***** something here;

run;

You need to learn how to use the variables A and B to select any of these combinations:

matches only
mismatches where the ID appears in CEO but not in OTHER_DATA
mismatches where the ID appears in OTHER_DATA but not in CEO
IDs that appear in CEO, whether or not they also appear in OTHER_DATA
IDs that appear in OTHER_DATA, whether or not they also appear in CEO

sastuck · Posted 03-17-2018 12:36 PM

Thanks.

in the code, are A and B symbolizing the different datasets?

Cynthia_sas · Posted 03-17-2018 02:01 PM

Hi:

Here's a concrete example to illustrate the use of the IN= option on the MERGE statement. The IN= option creates a temporary variable that holds either a 1 or a 0 value depending on whether the dataset contributes any information to the current Program Data Vector (PDV) being populated by the MERGE (the output data comes from the PDV). Your DATA step will create a PDV area for every dataset being created.

The nice thing about using IN= is that you can do multiple tests and multiple outputs based on your conditions. For example, Consider this data. WORK.ONE is the "name" file. It has COMPANY, JOB, YEAR and NAME. WORK.TWO is the "amount" file. It has COMPANY, JOB and AMOUNT:

data one;
  infile datalines;
  input company $ job $ year name $;
return;
datalines;
aa job1 2012 Alan
aa job4 2012 Alexa
bb job1 2012 Bob
cc job1 2012 Carla
dd job1 2012 Dana
dd job9 2012 Dave
ee job1 2012 Edith
;
run;

data two;
  infile datalines;
  input company $ job $ amount ;
return;
datalines;
aa job1 101
aa job2 102
aa job3 103
bb job1 201
cc job2 302
cc job3 303
dd job1 401
dd job9 409
ee job1 501
ff job1 601
;
run;

There are matches and non-matches and multiple jobs in both files. Merging on COMPANY and JOB gives me a unique combo of variables for the MERGE. However, if I do a simple MERGE, with only a BY, I will probably get undesirable results because the default is for all the matches and nonmatches to be put into the output dataset:

Given the data, I have some issues. I do have matches, but they are matches for a variety of jobs. What if I only want to get matches for the rows with JOB1? Well, that's pretty easy to do, but what about row #6 for Carla, she was in my WORK.ONE dataset, but there wasn't a match for her row in WORK.TWO - where do I want her row, in with matches or in with non-matches? Or do I need to go back and fix the data?

Let's look at the first use of a MERGE with IN=, where I save the values of the temporary variables into the output dataset, so I can examine them:

And, if I didn't want any further filter on JOB or YEAR, then this might be OK. But what if I DO want to separate the observations into one set of files where JOB=JOB1 and another set of files where JOB NE JOB1??? Then I have to modify the logic in the program and output to more datasets:


data match_both match_job1 match_not_job1
     name_only name_job1 name_not_job1
     amount_file_only amount_for_job1 amount_not_job1;
  merge one(in=in_name) two(in=in_amt);
  by company job;
  if in_name=1 and in_amt=1 then do;
     output match_both;
	 if job='job1' then output match_job1;
	 else if job ne 'job1' then output match_not_job1;
  end;
  else if in_name=1 and in_amt=0 then do;
     output name_only;
	 if job='job1' then output name_job1;
	 else if job ne 'job1' then output name_not_job1;
  end;
  if in_name=0 and in_amt=1 then do;
     output amount_file_only;
     if job = 'job1' then output amount_for_job1;
	 else if job ne 'job1' then output amount_not_job1;
  end; 
run;

proc print data=match_both;
  title '1) Match on Company and Job';
  title2 'Without filter on JOB';
run;

proc print data=name_only;
  title '2) Have name, but no amount';
  title2 'Without filter on JOB';
run;

proc print data=amount_file_only;
  title '3) Have amount, but not in name file';
  title2 'Without filter on JOB';
run;
 
proc print data=match_job1;
  title '4) Match on Company and Job';
  title2 'Only filter JOB1';
run;

proc print data=name_job1;  
  title '5) Have name, but no amount';
  title2 'Only filter JOB1';
run;

proc print data=amount_for_job1;  
  title '6) Have amount, but not in name file';
  title2 'Only filter JOB1';
run;

proc print data=match_not_job1;
  title '7) Match on Company and Job';
  title2 'NOT JOB1 Matches';
run;

proc print data=name_not_job1;
  title '8) Have name, but no amount';
  title2 'Name is NOT JOB1 have no match in amount file';
run;

proc print data=amount_not_job1;
  title '9) Have amount, but not in name file';
  title2 'Amount is NOT JOB1, have no match in name file';
run;

And the output has been filtered even further so that all the JOB1 rows -- matches and non-matches are in separate files so you can deal with them. Here's the output from all of the above PROC PRINTs, slightly rearranged:

There might be other ways you want to create the output datasets. You might want fewer datasets, you might want to treat the matches and nonmatches differently for the JOB1 rows. At this point, it is just a matter of changing the program logic.

Hope this helps,

cynthia

sastuck · Posted 03-18-2018 12:25 PM

Hi @Astounding, I am having trouble understanding/implementing your suggestions. Is there no simpler way to use CEO as opposed to the other datasets in the merge?

Thanks

Astounding · Posted 03-18-2018 12:32 PM

If you know more than a little SQL, that is sometimes an alternative. But note ... by definition, MERGE will combine 2 or more SAS data sets. When you combine them, you may find some mismatches ... a combination of BY variables that exists in one data set but not others. It is a "must learn" skill to be able to determine when you have a match and when you have a mismatch. It is a "must learn" skill to be able to select in your output data set the types of matches or mismatches that you would like. I'm not even saying that this will necessarily solve the problem you are encountering. I'm just saying that it is a possibility that it will, and learning how to determine matches and mismatches is something you will need week in and week out.