I want to read this into two variables: make and model name. E.g. make=chevrolet, model_name=chevelle malibu
Not sure how to do this since the lengths vary.
I tried the code below. But I am left with quotes " in the text variable.
Filename carnames url 'http://www';
Data Carnames;
INFILE Carnames MISSOVER;
LENGTH n1 n2 n3 $40;
INPUT @2 n1 $ n2 $ n3 $;
Run;
"chevrolet chevelle malibu"
"buick skylark 320"
"plymouth satellite"
"amc rebel sst"
"ford torino"
"ford galaxie 500"
"chevrolet impala"
"plymouth fury iii"
"pontiac catalina"
"amc ambassador dpl"
"citroen ds-21 pallas"
"chevrolet chevelle concours (sw)"
"ford torino (sw)"
"plymouth satellite (sw)"
"amc rebel sst (sw)"
Easy.
Data Carnames; INFILE cards dsd pad MISSOVER length=len; INPUT x $varying200. len; make=scan(dequote(x),1); model=substr(left(dequote(x)),find(dequote(x),' ')+1); cards; "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" "ford torino" "ford galaxie 500" "chevrolet impala" "plymouth fury iii" "pontiac catalina" "amc ambassador dpl" "citroen ds-21 pallas" "chevrolet chevelle concours (sw)" "ford torino (sw)" "plymouth satellite (sw)" "amc rebel sst (sw)" ; Run;
Ksharp
Slightly simplified version  ;
;
Data Carnames;
INFILE cards ;
INPUT ;
make=scan(dequote(_infile_),1);
model=substr(dequote(_infile_),length(make)+1);
cards;
"chevrolet chevelle malibu"
"buick skylark 320"
"plymouth satellite"
"amc rebel sst"
"ford torino"
"ford galaxie 500"
"chevrolet impala"
"plymouth fury iii"
"pontiac catalina"
"amc ambassador dpl"
"citroen ds-21 pallas"
"chevrolet chevelle concours (sw)"
"ford torino (sw)"
"plymouth satellite (sw)"
"amc rebel sst (sw)"
;
Run;
Haikuo
Devil's advocate here: what about "Alfa Romeo Giulia" ?
I can't understand what you mean . What output do you expect ?
Sorry Ksharp, seems I was too scarce with my comment. What I'm trying to convey:
Still unclear?
Robert
OK.
With regards to the hash table approach this begs the question whether (and I am talking about every day's reuse of the data here) the hash table can be hardened to a storage medium such that keyed access to a single item is possible without reloading the entire table first. Imagine a multimillion rows table with an index on one or more variables. Which one is more effective: accessing the desired row with a keyed access directly from the storage medium or loading the hash table into memory first and then doing a find on it? I guess it boils down to a "it depends".
Note that I have close to zero experience with hashes but from what I've read (code samples in SAS-L mainly) I'm under the impression that they aren't the swiss army knife for every data retrieval problem.
Use the $QUOTE. informat to eliminate the quotes around the value in the input file. Use the TRUNCOVER rather than the MISSOVER option to avoid problems when reading past the end of the physical line. With missover the value becomes missing when you read past the end, but with truncover you get the values that are there instead.
Read the line into a character variable and then parse out the model and make from there. This will also let you code in a list of models that require two words like ALFA ROMEO.
I also added UPCASE() to make it easier to test for specific values.
data carnames;
infile cards truncover ;
length line $100 make model $50 ;
input line $quote100. ;
line = upcase(line);
model = scan(line,1,' ');
if model in ('ALFA') then model=catx(' ',model,scan(line,2,' '));
make = substr(line,length(model)+2);
put ( line model make ) (=/);
cards;
"chevrolet chevelle malibu"
"buick skylark 320"
"plymouth satellite"
"amc rebel sst"
"ford torino"
"ford galaxie 500"
"chevrolet impala"
"plymouth fury iii"
"pontiac catalina"
"amc ambassador dpl"
"citroen ds-21 pallas"
"chevrolet chevelle concours (sw)"
"ford torino (sw)"
"plymouth satellite (sw)"
"amc rebel sst (sw)"
"Alfa Romeo Giulia"
run;
Well, it's no doubt doable. That wasn't the goal of the question. Rather to raise the awareness that the problem as it was presented is difficult to solve without some kind of lookup table containing manufacturer's names. Although I can't name one off the top of my head, mergers may have created names consisting of three or more parts.
I agree. I think your only option is to collapse the data by the first three or four strings, and from that derive a table of "make" values. After extracting that, the remainder becomes the "model". There are just too many odd possibilities to automate the detection.
However,one optimization might be to determine if there are only two words, and if there are assume that they are make and model. The makes extracted from that can then be used against the remainder, and any matches can be handled in an automated fashion. This should leave a very small residue for manual evaluation.
I've dealt with the same problem with municipality and province, and there's really no count-based solution.
Tom
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
