Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Data Merge problem

Reply
Occasional Contributor
Posts: 8

Data Merge problem

Hello All,

I am trying to merge two data sets Table1 has 3000 observation and Table2 has 6000 observation. I am trying to merge two data sets with three different variable those are common in both data sets. So after executing the code below i get the Merge1 table with 6000 observation as expected. however the Merge1 table is missing the observation for the variable i selected from the Table1 which is a.ACCNR, A.AN, a. RLTVM ect.  I checked the format of all three variables that i am trying to join in both data set and the format are the same. Also when i replaced right join with left join in the code below, i am missing the observation from the Table2.

Does anyone have an idea why the resulting table is missing the observations?

Proc sql;

create table Merge1 as

select    a.ACCNR, a.AN, a.RLTVM_BA,a.CYCLE_PK, a.site, a.conc, a.LNCONC, b.ACCNBR_IF, b.RLTVM_IF, b.ACTDATE, b.PKTM, b.CYCLE_IF, b.AN_IF

from Table1 a

right join Table2 b ON  a.AN=b.AN_if and A.RLTVM_BA=B.RLTVM_IF and a.cycle_PK=b.cycle_IF

order by AN;

quit;

Super Contributor
Posts: 337

Re: Data Merge problem

I am a data step person. While you wait for a SQL kind of guy, this is what I would do (specially because you have few observations and sorting is an option.

I hope I get it right that the key is all: an, rtlvm_ba, and cycle_pk.

E.g.

proc sort data=table1; by an rltvm_ba cycle_pk; run;

proc sort data=table2; by an_if rltvm_if cycle_if; run;

data merge1;

merge table1 table2(rename=(an_if=an rltvm_if=rtlvm_ba cycle_if=cycle_pk));

by an rltvm_ba cycle_pk;

run;

Occasional Contributor
Posts: 8

Re: Data Merge problem

Posted in reply to M_Maldonado

Hi Migue,

Invalid rename values since i already have the same name variable in data=table1.

Thanks

PROC Star
Posts: 7,485

Re: Data Merge problem

Your description is misleading, but I think you may want to do a full join and expect more than 6,000 observations to be output.

And, with a datastep merge, you can always include keep options along with your recode option.

Occasional Contributor
Posts: 8

Re: Data Merge problem

Hi Arthur

I tried the full join and i got around 9000 observation which is not right either. The resulted merge1 table contain first 6000 records from table2 and then 3000 record from table1. which is not right because i now there are about 90%common obseration between table1 and tabl2.

Thanks

Super Contributor
Posts: 337

Re: Data Merge problem

I thought that an_if rltvm_if cycle_if in table 2 were equivalent to an rltvm_ba cycle_pk in table 1.

Then rename in both tables I guess?

please catch my back on typos if I made any!

proc sort data=table1; by an rltvm_ba cycle_pk; run;

proc sort data=table2; by an_if rltvm_if cycle_if; run;

data merge1;

merge table1(rename=(an=an_global rltvm_ba=rtlvm_global cycle_pk=cycle_global)) table2(rename=(an_if=an_global rltvm_if=rtlvm_global cycle_if=cycle_global));

by an_global rltvm_global cycle_global;

run;

Super User
Posts: 19,833

Re: Data Merge problem

You merged 3000 with 6000 and you have missing for some of the a.an and other variables I'm assuming?

That means you didn't necessarily have a match between the observations, i.e. you have some combination in one table that's not in the other.

Your join has 3 variables, which you select from the larger table, except for AN variable. So you won't have any of the other 2 missing, but possibly the AN.

Is that what you're seeing?

Are you certain you have all of the matching records in each set?

Try the following to see if it helps to clarify the issues you're running into. Run a proc freq on the status variable to see what the issue might be.

Proc sql;

create table Merge1 as

select    a.ACCNR,

      a.AN,

     b.an_if,

     a.RLTVM_BA,

     B.RLTVM_IF,

     a.CYCLE_PK,

     b.CYCLE_IF ,

       case when missing(a.an) then 'Table1 Missing'

        else 'Match'

      end as status,

     a.site,

     a.conc,

     a.LNCONC,

     b.ACCNBR_IF,

     b.RLTVM_IF,

     b.ACTDATE,

     b.PKTM

   

from Table1 a

right join Table2 b

ON  a.AN=b.AN_if

and A.RLTVM_BA=B.RLTVM_IF

and a.cycle_PK=b.cycle_IF

order by AN;

quit;

Occasional Contributor
Posts: 8

Re: Data Merge problem

Thanks Reeza,

Yes i am certain that two datasets has common observation. I think the problem is a.cycle_PK and b.cycle_IF has  numeric character values. The format for cycle variable in both table were Character 20. I change the format to the numeric best12. Now i am missing those observation for which values entered in characters.

does anyone know how to resolve the issue?

Thanks

Super User
Posts: 5,513

Re: Data Merge problem

Reeza is right.  There are no matches.  Try printing a few unformatted values from one data set, and hard-code the search for them when reading the other data set.  You won't find any matches.

If you are dealing with character variables, watch for leading blanks that might cause a mismatch.

Good luck.

Respected Advisor
Posts: 4,173

Re: Data Merge problem

As others already stated your results strongly indicate that your 2 tables are not matching over the keys used. The variables must be of the same type (character or numeric) for matching, e.g. a.cycle_PK and b.cycle_IF must both be either numeric or both character. The format is not relevant. If the variables are character then leading blanks or different casing are treated as different strings (= no match).


For character variables try adding something like "upcase(strip(a.cycle_PK)) = upcase(strip(b.cycle_IF))".

For numeric variables with floating point values and the source tables coming from some data base what can happen is a very small "imprecision" when transferring the values from one environment to the other. So here try and use a round() function and see if this makes a difference, eg. "round(a.cycle_PK,0.0000000001) = round(b.cycle_IF,0.0000000001)".

Trusted Advisor
Posts: 3,214

Re: Data Merge problem

When you are merging on numeric ones be sure they are not influenced by calculations so you are caught by numeric precision differences.

Rounding in advance would solve that loophole. Changing a format is just fooling yourself, a format does not change the value just how the value is shown to you.

For a right join you will never get more ovservations as the of the right dataset, dropping the ones that cannot be merged.

---->-- ja karman --<-----
Ask a Question
Discussion stats
  • 10 replies
  • 656 views
  • 0 likes
  • 7 in conversation