This is an example of a transposition cipher. Assuming you have sufficient information about cars in the first data set, you might be able to match the codes with some or all of the information in the second data set.
A problem that you will encounter is that there is no guarantee of a unique solution. Suppose anti-lock brakes and electronic door locks always appear together and no car has one without the other. Then there is no way to determine that XYZ is anti-lock brakes whereas PQR is electronic door locks. On the other hand, it doesn't matter for most analyses how you code these features if they always appear together.
For IML programs about ciphers, see https://blogs.sas.com/content/tag/ciphers/ . All of the programs use an English letter corpus to provide a way to score whether a given transposition is accurate. Some also use frequency analysis of letters and bigrams. I think those issues are important: If you know that some features (anti-lock brakes) appear more often than others (sun roof), you can use the relative frequencies to guide the analysis. If you know that certain features are bundled together, that is akin to bigrams and trigrams in traditional text analysis.
You propose an iterative approach. First, decipher all those codes for vehicles that have one feature. Then try to use those solutions to attack the vehicles that have two features. However, at this step, you might encounter vehicles that have two features that are not yet known. At that point, your method breaks down. This is when you might consider a frequency analysis.
I think this is a difficult problem to try to code by hand and solve in an automated fashion. Especially, if the number of coded values ("letters") is large.
I don't see anything in your table A that would allow to match a record to table B. Is there any other information available in the data that would allow to establish such a relationship (like sort order or what else)?
OK. You might understand what you're talking about as I don't. Even if it's a combinatorial problem then you would still need information that links the data between the two tables as else it's just going to be "random" or a cartesian product whatever smart algorithm you can come-up with.
And there is NOTHING in the data you shared that would allow you to make such a connection.
For example if you would have the following two tables...
Table Weather Sunshine Rain Table Location Mt. Waialeale on Kauai in Hawaii Phoenix, Arizona
.... then without any further information nothing will allow you to tell that Phoenix is one of the sunniest places in the US and Mt. Waialeale one of the rainiest (according to Google).
That's what you're dealing with here if you haven't got more information than what you shared with us.
@acordes wrote:
I have hundreds of thousands of cars with their specific equipment.
The challenge is to find a rule that code 01234 always is present with PH1.
Are they all the same maker? IF not good luck as ABC from one make might well be PDQ from another.
Also within the same make the same code may mean different things based on the model and the model year.
Plus cars that shared components may have the same item with different code as the "high end" option in one car could be the low end in another line.
For a rough example, PONTIAC with WS6 is the code for the current model year "performance" or "sports" suspension package and depending on year might mean manual or automatic transmission, specific wheels, modified cooling system and/or exhaust, possibly hood configuration and who knows what exterior badging might or might not appear for any specific year. It appears that as later buyers were no longer able to specify single options that some makers codes became "trim level" indicators and not quite so much the individual items.
You need some other information to even begin to attempt to make any attempt to match the codes.
For example if you knew that particular option was selected for 50% of the cars then you could analyze the data on the actual cars and see what they call something that appears on half the cars and perhaps imply those two option names are the same.
This is an example of a transposition cipher. Assuming you have sufficient information about cars in the first data set, you might be able to match the codes with some or all of the information in the second data set.
A problem that you will encounter is that there is no guarantee of a unique solution. Suppose anti-lock brakes and electronic door locks always appear together and no car has one without the other. Then there is no way to determine that XYZ is anti-lock brakes whereas PQR is electronic door locks. On the other hand, it doesn't matter for most analyses how you code these features if they always appear together.
For IML programs about ciphers, see https://blogs.sas.com/content/tag/ciphers/ . All of the programs use an English letter corpus to provide a way to score whether a given transposition is accurate. Some also use frequency analysis of letters and bigrams. I think those issues are important: If you know that some features (anti-lock brakes) appear more often than others (sun roof), you can use the relative frequencies to guide the analysis. If you know that certain features are bundled together, that is akin to bigrams and trigrams in traditional text analysis.
You propose an iterative approach. First, decipher all those codes for vehicles that have one feature. Then try to use those solutions to attack the vehicles that have two features. However, at this step, you might encounter vehicles that have two features that are not yet known. At that point, your method breaks down. This is when you might consider a frequency analysis.
I think this is a difficult problem to try to code by hand and solve in an automated fashion. Especially, if the number of coded values ("letters") is large.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.