09-13-2016 04:28 AM
I have two datasets, in dataset one, I have some states with state id(stateid) but vacant state name(statename). In dataset two, I have only two variables: state id(stataid), state name(statename), and all stateid and statename are not vacant, I sort both dataset by stateid and merge the two, my code is as follows:
merge one two;
but I still have the same vacant state names!
What's wrong with my code?
Thanks a lot!
09-13-2016 04:43 AM
When you have same variable (state name) on both data sets, only one takes place in output.
You can either drop the vacant sate name or use update instead merge, where the second data set
updates variables from the first one.
09-13-2016 05:06 AM - edited 09-13-2016 05:13 AM
Let's dig a little into the workings of the data step.
data have1; infile cards truncover; input state_id state_name $; cards; 1 1 2 2 3 ; run; data have2; infile cards truncover; input state_id state_name $; cards; 1 NY 2 CA 2 CA 3 FL ; run; data want; merge have1 have2; by state_id; run;
When the third data step is compiled, SAS looks at the structure of the contributing datasets, and creates a PDV (program data vector) with the variables state_id and state_name.
When the data step executes its first iteration, data is read from have1, resulting in an empty state_name; next, data from have2 is read (as state_id matches), and state_name is set to "NY". Data is output.
Then, in the second iteration, a matching record from have1 is read, once again state_name is set to empty; since no further matching record is present in have2, no data is read from there, and therefore the empty state_name is written to the ouput.
With state_id=2 I deliberately inserted a second record in have2, and since a second matching read from have2 can be performed, the value "CA" is set for both empty records in have1.
A proper merge would look like that:
data want; merge have1 have2 (rename=(state_name=new_state_name)) ; by state_id; if state_name = ' ' then state_name = new_state_name; drop new_state_name; run;
Edit: took care of improper code formatting and typos
09-13-2016 06:44 AM
The values for STATE_NAME get read only once, from each source of data. So on the first observation per StateID, the STATE_NAME always comes from TWO, as the values from TWO overwrite the values from ONE. But after the first observation per StateID, the values from ONE are read in and no longer get overwritten by values from TWO. So the simple solution would be:
merge one (drop=state_name) two;