Re: trying to match two datasets

elopomorph88 · Posted 05-28-2020 02:41 PM

OK. I am using Proc Compare but I am just not getting the answer that I need. Here is my situation:

I have two datasets: Dataset A and dataset B that are different. However, they have a column of data that is the same. There is a site variable that is the same between the two datasets. However, they have different lists of sites so not all of the sites listed in dataset A are listed in dataset B, and vice versa. They also have different variable between the two datasets, but they do share a site variable. We can call the shared site variable "Site X". Site X is listed as a format of $10 in both datasets. Site X is an odd text file composed of both letters and text. An example of Site X is FL0952MB. I want to match site X in both dataset A and B, but in a certain way.

I want dataset A to stay the same but I want to know if any of the specific Site X sites from dataset B shows up in dataset A. I know it will for some of the records but I have datasets of over 1,000 records and I don't want to go through them by hand. So I want to generated a column of data in dataset A that is something like a 0 if that specific Site X is NOT in dataset B and a 1 if that specific site IS listed in dataset B. I am hoping to get an output like this for Dataset A where the "matched" column means that that specific site X was listed in dataset B:

Site X matched

FL0952MB 1

RL0989MG 1

IU0882RF 0

GR1752PO 0

YZ5632UY 1

Any help would be greatly appreciated.

ballardw · Posted 05-28-2020 03:01 PM

It really helps to provide examples from data from both sets.

Here is one way that includes creating two small data sets that the code can test on.

data dataseta;
   input site $;
datalines;
abc
pdq
x34
;
data datasetb;
   input site $;
datalines;
abc
x34
ggg
;

proc sql;
   create table InA as
   select a.site, ( a.site=b.site) as matched
   from (select distinct site from dataseta ) as a
        left join
        (select distinct site from datasetb) as b
        on a.site=b.site
   ;
quit;

Obviously I don't have your data. If your "site" variable name is different in the two sets replace the name for all the places the variable is used. The A. and B. (the DOT is important) is an SQL alias that allows you to not repeatedly type a longer data set name, tells the procedure which contributing table has the variable, and is pretty much required if your datasets are in permanent libraries as lib.datset.variable will generate syntax errors.

The subquery with the "select distinct" creates sets with only one of each value for comparison. The Left Join says "all the values from the A (or left referenced set) will be included in the output. The ON sets the condition to match things, in this case the equality of the sites. The Select with the (a.site=b.site) as matched uses the property that SAS will use 1 for true/0 for false in results of logic. Where the values don't match, the B set doesn't have the site, the result is False.

elopomorph88 · Posted 05-28-2020 04:03 PM

Thanks to both of you!

PGStats · Posted 05-28-2020 03:39 PM

Quite straitforward with proc SQL:

proc sql;
create table datasetC as
select
	*,
	siteX in (select siteX from datasetB) as matched
from datasetA;
quit;

PG

novinosrin · Posted 05-28-2020 04:49 PM


data datasetC  ;
 set dataseta;
 if _n_=1 then do;
 if 0 then set  datasetb;
   dcl hash H (dataset:'datasetb') ;
   h.definekey  ("site") ;
   h.definedone () ;
 end;
 matched=h.check()=0;
run;

trying to match two datasets