I have these 2 datasets:

1) A list of IDs with addresses from various countries

2) A list of countries with related terms e.g. China (country) and Beijing (related term to China).


How do I assign a country to each ID in the first dataset?

This depends on the size of the lookup table (linking places to countries). If it is large, joining the tables on places/towns/zip codes is recommended. If it is small enough to fit into available memory, creating a format from it or using a hash might be a better solution.

Addresses tend to be free form text, and parsing that in join criteria could be expensive. Anotheralternative would be to load the lookup table as a hash and do the lookup in memory.
