Solved: How to refer to columns in a particular dataset in a DATA STEP merge?

tomcmacdonald · Posted 04-04-2018 02:09 PM

These are my test datasets:

data a;
	input id variety $;
	datalines;
1 jazz
2 gala
3 empire
4 rome
5 mcintosh
;
run;

data b;
	input id place $;
	datalines;
1 binghamton
2 westchester
3 scranton
4 utica
5 albany
;
run;

Sorting boilerplate:

proc sort data=a; by id; run;
proc sort data=b; by id; run;

I'm trying to do something like this but with the DATA STEP:

proc sql;
	create table foo as 
	select *
	from a join b on a.id = b.id 
	where prxmatch('/^a/i', b.place);
quit;

This is my attempt but it throws syntax errors:

data foo;
	merge a(in=a) b(in=a);
	by id;
	where prxmatch('/^a/i', b.place);
run;

How would I refer to b.place in the data step?

Tom · Posted 04-04-2018 02:59 PM

@tomcmacdonald wrote:

This is the error message I get when executing your first code snippet:
ERROR: Variable place is not on file WORK.A.

The purpose of the WHERE statement is to limit the data that is read into the data step. If you use a WHERE statement in a data step it is applied to all of the input dataset.

If you want to limit the observations that are written out you could just convert to a subsetting IF statement instead or convert the logic to use DELETE or OUTPUT statements to determine which observations are written out.

data want;
  merge A B ;
  by ID ;
  if (....) ;
run;

Or you could use the WHERE= dataset option on the dataset it applies to to limit the records read from that dataset.

data want;
  merge A B(where=(....)) ;
  by ID ;
run;

View solution in original post

ballardw · Posted 04-04-2018 02:27 PM

In this case since PLACE only exists in one data set there is no need to attempt to specify the set, it is already in the data vector as the variable Place. Just use: where prxmatch('/^a/i', place);

Assuming that you actually meant (otherwise both data sets contributed the same in variable)

data foo;
	merge a(in=a) b(in=b);
	by id;
	where prxmatch('/^a/i', b.place);
run;

MERGE combines the records before the executable steps following the merge are executed. So the concept of a variable belonging to as specific data set is gone.

You can reference records that contributed to the current record with

If b then <do something>

However to see a real difference between data step merge and sql join use this data b:

data b;
	input id place $ variety $;
	datalines;
1 binghamton  a
2 westchester b
3 scranton    c
4 utica       d
5 albany      e
;
run;

and see what happens in data foo.

Then do the same data foo but reverse the order of a and b on the merge statement.

tomcmacdonald · Posted 04-04-2018 02:31 PM

This is the error message I get when executing your first code snippet:

ERROR: Variable place is not on file WORK.A.

Tom · Posted 04-04-2018 02:59 PM

@tomcmacdonald wrote:

This is the error message I get when executing your first code snippet:
ERROR: Variable place is not on file WORK.A.

The purpose of the WHERE statement is to limit the data that is read into the data step. If you use a WHERE statement in a data step it is applied to all of the input dataset.

If you want to limit the observations that are written out you could just convert to a subsetting IF statement instead or convert the logic to use DELETE or OUTPUT statements to determine which observations are written out.

data want;
  merge A B ;
  by ID ;
  if (....) ;
run;

Or you could use the WHERE= dataset option on the dataset it applies to to limit the records read from that dataset.

data want;
  merge A B(where=(....)) ;
  by ID ;
run;

ballardw · Posted 04-04-2018 06:56 PM

Please show code executed. I did not intend you to run

data foo;
	merge a(in=a) b(in=b);
	by id;
	where prxmatch('/^a/i', b.place);
run;

because the b.place is an error, that was to point out you had used (in=a) for both set a and b.

Executable code would be

data foo;
	merge a b;
	by id;
	where prxmatch('/^a/i',place);
run;

Since the In variables really weren't of use. And assumes the prxmatch is valid.

How to refer to columns in a particular dataset in a DATA STEP merge?

Re: How to refer to columns in a particular dataset in a DATA STEP merge?

Re: How to refer to columns in a particular dataset in a DATA STEP merge?

Re: How to refer to columns in a particular dataset in a DATA STEP merge?

Re: How to refer to columns in a particular dataset in a DATA STEP merge?

Re: How to refer to columns in a particular dataset in a DATA STEP merge?

How to refer to columns in a particular dataset in a DATA STEP merge?

Re: How to refer to columns in a particular dataset in a DATA STEP merge?

Re: How to refer to columns in a particular dataset in a DATA STEP merge?

Re: How to refer to columns in a particular dataset in a DATA STEP merge?

Re: How to refer to columns in a particular dataset in a DATA STEP merge?

Re: How to refer to columns in a particular dataset in a DATA STEP merge?

Click image to register for webinar

Classroom Training Available!