Two Datasets without a Unique_ID

Reply
Contributor
Posts: 49

Two Datasets without a Unique_ID

I’ve two big datasets and both don’t have a unique_id to join/merge, Does any one have an idea how to join them? Data_1 is a summary file and Data_2 is a detail of the summary of the file. Here is an example of both datasets. Any help is appreciated?

Data_1 is a summary file

IDGroup
00233ABCD
00233ABCD
02013Oxy
02013Oxy
02013Oxy

Data_2 is a detail of the summary of the file

IDGroup
0233ABCD
0233ABCD
0233ABCD
0233ABCD
0233ABCD
0233ABCD
0233ABCD
PSM2013Oxy Inc
PSM2013Oxy Inc
PSM2013Oxy Inc
PSM2013Oxy Inc
PSM2013Oxy Inc
PSM2013Oxy Inc
PROC Star
Posts: 1,231

Re: Two Datasets without a Unique_ID

Hi,

I think it would help people help you if you decribe what you would want for the output dataset, and what logic rules would get you there.

--Q.

Super Contributor
Posts: 276

Re: Two Datasets without a Unique_ID

Hi,

Can you please explain how the output data should look like ??

Thanks,

Sanjeev.K

Super Contributor
Posts: 644

Re: Two Datasets without a Unique_ID

Given the sample you have provided, and the similarities of the IDs you likely have another problem, that of a many to many join which will throw up a Cartesian product.  Ie, if you successfully join the 3 oxy records in 1 with the 5 records in 2 you will have 15 records.  You probably need to do some analysis on your data to narrow down the multiples in at least one table.

Having said that it looks (from the very small sample) that the numerical component of the IDs could be used to match the two tables.  I would suggest a numeric conversion on each table:

     Id_Num = input (compress (Id, , 'KD'), best.) ;

But if you go down that track you had better to a cross tab of Id against Id_Num to ensure that the values of Id_Num are unique.  If they are not you are going to have to dig deeper to find how these IDs relate to each other.

Richard in NZ

Contributor
Posts: 49

Re: Two Datasets without a Unique_ID

Thank you all for the great respond and input. Actually this question is no longer needed. I appreciated your time.

RichardinOZ - Can you explain what do you mean 'KD' best in your example?

Id_Num = input (compress (Id, , 'KD'), best.) ;

Thanks,

Super Contributor
Posts: 644

Re: Two Datasets without a Unique_ID

Ali

The aim was to convert character ids with a mix of numeric and alpha characters into numeric ids by stripping out unwanted characters.  Type conversion to numeric would also deal with varying leading zeros.

To convert character calues to numeric the input() function is used, in this case with the informat best.  You may be more familiar with the reverse process converting numeric values to character using the put() function and the format also called best.

In this case we need first to get rid of alpha characters such as PSM and that is the role of the compress() function.  Normally the second parameter of the function specifies which characters to remove, so I could have written

     compress (Id, 'ABCDEFG ... Zabcdefg ... z!#$%^&*')

to be sure of eliminating all non numeric characters.  This is impractical so the third parameter can have the value 'K' (= keep) which reverses the effect of the second parameter so you can specify instead a string of characters you wish to keep.  So the function could have been written

     compress (Id, '0123456789', 'K')

But wait, there's more - the third parameter can also include codes that are short for common expressions.  In this case, 'D' means decimals (ie the string '0123456789') so we don't even have to type in this list.  In fact, the second parameter now can be blank. So

     compress (Id, , 'KD')   [note the two consecutive commas]

means "Just keep any decimal (ie numeric) characters in the string and discard everything else".

Richard in NZ (but going home for Christmas)

Contributor
Posts: 49

Re: Two Datasets without a Unique_ID

Thanks, Richard. that was helpful. Happy Holidays

Frequent Contributor
Posts: 114

Re: Two Datasets without a Unique_ID

Hi,

It would be great for us if you elaborate your conditions for the output data/results.

Regards

Uma Shanker Saini

Ask a Question
Discussion stats
  • 7 replies
  • 348 views
  • 3 likes
  • 5 in conversation