topic Re: Issue merging data in Statistical Procedures

Issue merging data

tgm4052 — Fri, 10 Oct 2014 20:14:31 GMT

For the variables I have checked, they appear to be numeric in both datasets.

13678 data call.&name;

13679 retain RSSD9001 RSSD9999;

13680 merge work.&name call.&name (rename=(RCON9999=RSSD9999));

ERROR: Variable RSSD9050 has been defined as both character and numeric.

ERROR: Variable RCON6979 has been defined as both character and numeric.

ERROR: Variable RCON5562 has been defined as both character and numeric.

ERROR: Variable RCON5563 has been defined as both character and numeric.

ERROR: Variable RCON5564 has been defined as both character and numeric.

ERROR: Variable RCON5565 has been defined as both character and numeric.

ERROR: Variable RCON5566 has been defined as both character and numeric.

ERROR: Variable RCON5567 has been defined as both character and numeric.

ERROR: Variable RCON5568 has been defined as both character and numeric.

ERROR: Variable RCON5569 has been defined as both character and numeric.

The instructions for this data merge are here: http://chicagofed.org/digital_assets/others/banking/financial_institution_reports/CDR_TabDelimitedFileImport.txt

But I can't seem to resolve the issue. Any help would be greatly appreciated

Re: Issue merging data

ballardw — Fri, 10 Oct 2014 20:20:39 GMT

Run Proc contents on both sets or Proc Compare I bet you will find the variables are different types.

A very likely cause is importing data with Proc Import, especially from spreadsheets, where the type is being "guessed". Then some values look like numbers but at least one wasn't and the type is character.

One solution is to import one of the files with proc import and use the code generated to read another file with similar format changing in and output filename/dataset. Or since your link mentions Tab delimited, write code to read the file from scratch.

Re: Issue merging data

tgm4052 — Mon, 13 Oct 2014 14:40:11 GMT

I suppose I'll give that a shot. Otherwise, I might try UPDATE instead of MERGE and see if that doesn't yield the same error messages. There would simpler alternative like proc sql but this dataset has roughly 4000 variables.

Re: Issue merging data

RW9 — Mon, 13 Oct 2014 14:53:57 GMT

Well, you haven't specified a keep on either dataset, so you are getting all the variables from the first table and all the variables from the second. Do you need them all? If so consider applying a rename to all the variables from the second table. Or as you mention if you want a merge of all the data together then update in a datastep, or update via SQL would be a good idea, though you do as ballardw mentions above look at the meta data of each table and identify why the variables have different types. It could be many things, proc import "deciding" on its own what the datatype shoul be, maybe on older data it was all numbers, now with later data some text has appeared etc. Maybe those variables don't exist so have been created, and end up with differing types. Step through your program one step at a time and see what is created and what type is given (right click on dataset and view columns). I also note you have two different namings RCON and RSS, make sure they are assigned correctly.

Re: Issue merging data

Kurt_Bremser — Mon, 13 Oct 2014 14:59:59 GMT

The ERROR messages indicate that variables of the same name are contained in both datasets to be merged. Aside from the variables in the by statement, this is a BAD idea because it will lead to (almost) unpredictable values in the resulting dataset.

Re: Issue merging data

tgm4052 — Mon, 13 Oct 2014 16:33:34 GMT

Unfortunately I do need to retain all variables. Basically they added new variables to the Fed's Call Reports after 2011 and to keep the analysis consistent I need to merge the two, the vast majority of the data should be identical with differences arising with the advent of the new variables.

Re: Issue merging data

Peter_C — Mon, 13 Oct 2014 16:55:28 GMT

tgm4052 wrote:

Unfortunately

they added new variables

YOU .NEED to keep the analysis consistent

majority of the data should be identical with differences arising with the advent of the new variables.

except thar

there are other differences

- in the data type in columns identified in those SAS error messages.

RSSD9050 both character and numeric.

RCON6979 and RCON5562

Re: Issue merging data

tgm4052 — Mon, 13 Oct 2014 17:42:26 GMT

That's right. It is suggested that the two data files are identical except for the new variables. But obviously there is a difference here between what ought to be and what is. Unfortunately, I don't use SAS much and this is turning especially problematic. Is there a way to use proc sql without identifying every one of the 3000-4000 variables.

Re: Issue merging data

Kurt_Bremser — Tue, 14 Oct 2014 05:29:27 GMT

But you won't keep the values of the data this way. Merging datasets with identical variables leads to bogus data, since which value "trumps" during the merge is quite unpredictable.

You need to rename the variables so that they can exist in parallel, and you can then decide in the data step which values to keep.

Since SAS already complains about different data types, you need to do some serious cleaning before you actually get to the merge.

If the variables in common _should_ contain identical values (I would NEVER trust statements like that without thoroughly checking the data first. I consider all outside data bullshit until proven otherwise; just think of the fact that in your case you have numeric and character mixed up, already), then you do best by dropping them from one of the datasets during the merge, so that you avoid conflict. You can write a preceding step to determine the common variables and set up a drop list in a macro variable.

Re: Issue merging data

Peter_C — Tue, 14 Oct 2014 05:55:15 GMT

Kurt

i thought it was documented that non by-variables are used from the most-right table that contributes that variable ( now I must find that doc ).

of course even if we can be that specific it wont fix the problem in that ERROR:

Re: Issue merging data

Kurt_Bremser — Tue, 14 Oct 2014 06:20:14 GMT

Imagine the following situation:

dataset have1:

Key Data

A 1

A 2

A 3

dataset have2:

Key Data

A 4

A 5

data want;

merge have1 have2;

by Key;

run;

Dataset want will have the values of have2 in the records 1 & 2 of Key 'A', but record 3 will have the values of have1.

I repeat: having non-by variables with identical names in a merge is usually a BAD idea. Luckily, tgm4052's example crashed because of differing data types, resulting in a good learning example about cleaning input data.

So what needs to be done?

- Make sure that variables with identical names have identical data types

- Check if identical variables in identical records (as identified by key or position) have correct identical values.

- Make sure that no ill effects arise like what I mentioned at the beginning of this post

- Write a solid merge that deals with everything

Re: Issue merging data

Peter_C — Tue, 14 Oct 2014 07:20:21 GMT

Although agreeing with you, I would make a further recommendation:

When treating data like transactions for a database table probably we should use the db----- options like DBTYPE

and save PROC IMPORT for places where the input structure is unknown.

Of course it would help if data suppliers provided a definition of the data supplied.

Re: Issue merging data

jakarman — Tue, 14 Oct 2014 07:42:02 GMT

Reviewing the instructions for getting the data they are using proc import.

It could be better to replace that one with an approach that does not guessing column types.

Bu guessing you have most likely the cause the datasets got different types. That is the failure at merging.

Using the datastep merge the first (left) value gets precedence unless missing. That is different when using sql where you have to specify that.