BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
acordes
Rhodochrosite | Level 12
I'm writing from my smartphone. Therefore I'm not providing a data set to illustrate my challenge.
I have an internal 3 alphanumeric codification for optional equipment.
This is just a technical code, you cannot derive information from it, if the code "PH1" stands for roof top or whatever.
I have a second table from a provider with the literal description of the equipment associated to the provider's equipment code.

Both tables are on chassis number level.
So imagine we have a car identified by the chassis number xyz with 3 optional equipment.
Table A
Ph1
Gf4
Abb

And table B
01342 roof top
04533 leather seats
07823 multimedia package

Now my challenge is to match our code with their code.

My initial idea would be to start with low equipped cars (ideally only 1 equipment code) y solve sequentially the matching links.
I would resort to IML to do so.

What Do you think? Are their procs that could get me started?

I have the feeling that a matrix approach with SVD could yield good results as well. It's quite similar to text analytics, isn't it?

I'm just thinking loud.
So I would span a matrix with the
Provider's code as rows and 0/1 in the columns with our internal code.
1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

This is an example of a transposition cipher. Assuming you have sufficient information about cars in the first data set, you might be able to match the codes with some or all of the information in the second data set.

 

A problem that you will encounter is that there is no guarantee of a unique solution. Suppose anti-lock brakes and electronic door locks always appear together and no car has one without the other. Then there is no way to determine that XYZ is anti-lock brakes whereas PQR is electronic door locks. On the other hand, it doesn't matter for most analyses how you code these features if they always appear together. 

 

For IML programs about ciphers, see https://blogs.sas.com/content/tag/ciphers/ . All of the programs use an English letter corpus to provide a way to score whether a given transposition is accurate. Some also use frequency analysis of letters and bigrams. I think those issues are important: If you know that some features (anti-lock brakes) appear more often than others (sun roof), you can use the relative frequencies to guide the analysis.  If you know that certain features are bundled together, that is akin to bigrams and trigrams in traditional text analysis.

 

You propose an iterative approach. First, decipher all those codes for vehicles that have one feature. Then try to use those solutions to attack the vehicles that have two features. However, at this step, you might encounter vehicles that have two features that are not yet known. At that point, your method breaks down. This is when you might consider a frequency analysis.

 

I think this is a difficult problem to try to code by hand and solve in an automated fashion. Especially, if the number of coded values ("letters") is large.

 

View solution in original post

8 REPLIES 8
Patrick
Opal | Level 21

I don't see anything in your table A that would allow to match a record to table B. Is there any other information available in the data that would allow to establish such a relationship (like sort order or what else)?

acordes
Rhodochrosite | Level 12
No. That's all. I don't have any additional information. The order of table A and B is arbitrary.
Some chassis numbers (meaning one specific car) were built with 5 optional equipment meanwhile others could have 15.
So it's a combinatorial problem.

I was just searching the community, and I think my problem can be solved with proc assoc and that it's known under the term market basket analysis.
Patrick
Opal | Level 21

OK. You might understand what you're talking about as I don't. Even if it's a combinatorial problem then you would still need information that links the data between the two tables as else it's just going to be "random" or a cartesian product whatever smart algorithm you can come-up with.

acordes
Rhodochrosite | Level 12
I have hundreds of thousands of cars with their specific equipment.
The challenge is to find a rule that code 01234 always is present with PH1.
Patrick
Opal | Level 21

And there is NOTHING in the data you shared that would allow you to make such a connection.

 

For example if you would have the following two tables...

Table Weather
Sunshine
Rain

Table Location
Mt. Waialeale on Kauai in Hawaii
Phoenix, Arizona

.... then without any further information nothing will allow you to tell that Phoenix is one of the sunniest places in the US and Mt. Waialeale one of the rainiest (according to Google).

 

That's what you're dealing with here if you haven't got more information than what you shared with us.

ballardw
Super User

@acordes wrote:
I have hundreds of thousands of cars with their specific equipment.
The challenge is to find a rule that code 01234 always is present with PH1.

Are they all the same maker? IF not good luck as ABC from one make might well be PDQ from another.

Also within the same make the same code may mean different things based on the model and the model year.

 

Plus cars that shared components may have the same item with different code as the "high end" option in one car could be the low end in another line.

 

For a rough example, PONTIAC with WS6 is the code for the current model year "performance" or "sports" suspension package and depending on year might mean manual or automatic transmission, specific wheels, modified cooling system and/or exhaust, possibly hood configuration and who knows what exterior badging might or might not appear for any specific year. It appears that as later buyers were no longer able to specify single options that some makers codes became "trim level" indicators and not quite so much the individual items.

 

 

Tom
Super User Tom
Super User

You need some other information to even begin to attempt to make any attempt to match the codes. 

For example if you knew that particular option was selected for 50% of the cars then you could analyze the data on the actual cars and see what they call something that appears on half the cars and perhaps imply those two option names are the same.

Rick_SAS
SAS Super FREQ

This is an example of a transposition cipher. Assuming you have sufficient information about cars in the first data set, you might be able to match the codes with some or all of the information in the second data set.

 

A problem that you will encounter is that there is no guarantee of a unique solution. Suppose anti-lock brakes and electronic door locks always appear together and no car has one without the other. Then there is no way to determine that XYZ is anti-lock brakes whereas PQR is electronic door locks. On the other hand, it doesn't matter for most analyses how you code these features if they always appear together. 

 

For IML programs about ciphers, see https://blogs.sas.com/content/tag/ciphers/ . All of the programs use an English letter corpus to provide a way to score whether a given transposition is accurate. Some also use frequency analysis of letters and bigrams. I think those issues are important: If you know that some features (anti-lock brakes) appear more often than others (sun roof), you can use the relative frequencies to guide the analysis.  If you know that certain features are bundled together, that is akin to bigrams and trigrams in traditional text analysis.

 

You propose an iterative approach. First, decipher all those codes for vehicles that have one feature. Then try to use those solutions to attack the vehicles that have two features. However, at this step, you might encounter vehicles that have two features that are not yet known. At that point, your method breaks down. This is when you might consider a frequency analysis.

 

I think this is a difficult problem to try to code by hand and solve in an automated fashion. Especially, if the number of coded values ("letters") is large.

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 757 views
  • 1 like
  • 5 in conversation