BookmarkSubscribeRSS Feed
sandeep249
Calcite | Level 5

I want to read this into two variables: make and model name. E.g. make=chevrolet, model_name=chevelle malibu

Not sure how to do this since the lengths vary.

I tried the code below. But I am left with quotes " in the text variable.

Filename carnames url 'http://www';

Data Carnames;

INFILE Carnames MISSOVER;

LENGTH n1 n2 n3 $40;

INPUT @2 n1 $ n2 $ n3 $;

Run;

"chevrolet chevelle malibu"

"buick skylark 320"

"plymouth satellite"

"amc rebel sst"

"ford torino"

"ford galaxie 500"

"chevrolet impala"

"plymouth fury iii"

"pontiac catalina"

"amc ambassador dpl"

"citroen ds-21 pallas"

"chevrolet chevelle concours (sw)"

"ford torino (sw)"

"plymouth satellite (sw)"

"amc rebel sst (sw)"

10 REPLIES 10
Ksharp
Super User

Easy.

Data Carnames;
INFILE cards dsd pad MISSOVER length=len;
INPUT x $varying200. len;
make=scan(dequote(x),1);
model=substr(left(dequote(x)),find(dequote(x),' ')+1);
cards;
"chevrolet chevelle malibu"
"buick skylark 320"
"plymouth satellite"
"amc rebel sst"
"ford torino"
"ford galaxie 500"
"chevrolet impala"
"plymouth fury iii"
"pontiac catalina"
"amc ambassador dpl"
"citroen ds-21 pallas"
"chevrolet chevelle concours (sw)"
"ford torino (sw)"
"plymouth satellite (sw)"
"amc rebel sst (sw)"
;
Run;

Ksharp

Haikuo
Onyx | Level 15

Slightly simplified version Smiley Happy;

Data Carnames;

INFILE cards ;

INPUT ;

make=scan(dequote(_infile_),1);

model=substr(dequote(_infile_),length(make)+1);

cards;

"chevrolet chevelle malibu"

"buick skylark 320"

"plymouth satellite"

"amc rebel sst"

"ford torino"

"ford galaxie 500"

"chevrolet impala"

"plymouth fury iii"

"pontiac catalina"

"amc ambassador dpl"

"citroen ds-21 pallas"

"chevrolet chevelle concours (sw)"

"ford torino (sw)"

"plymouth satellite (sw)"

"amc rebel sst (sw)"

;

Run;

Haikuo

Robert_Bardos
Fluorite | Level 6

Devil's advocate here: what about "Alfa Romeo Giulia" ?

Ksharp
Super User

I can't understand what you mean . What output do you expect ?

Robert_Bardos
Fluorite | Level 6

Sorry Ksharp, seems I was too scarce with my comment. What I'm trying to convey:

  • ^'model' and 'make' are difficult to parse out when the only delimiter between them are blanks
    • examples: 'alfa romeo giulia', 'sunbeam talbot alpine', 'steyr puch haflinger' where 'model' consists of the
      first two words
    • examples with more than two words designating 'model' are imaginable though I can't name one
      off the top of my head
  • values requiring special treatment should not be hard coded in programs but stored in tables much rather
    • i.e. sequences of "if lowcase(scan(_infile_,1)) = 'alfa' then ....'  statements in the program source
      tend to become a maintenance nightmare

Still unclear?

Robert

Ksharp
Super User

OK.

  • It is up to OP. Since OP only show us taking the first word as make, we will not be able to image so far situation.
  • But If I were you ,I will use a Hash table to maintain these key words and get what you want.
Robert_Bardos
Fluorite | Level 6

With regards to the hash table approach this begs the question whether (and I am talking about every day's reuse of the data here) the hash table can be hardened to a storage medium such that keyed access to a single item is possible without reloading the entire table first. Imagine a multimillion rows table with an index on one or more variables. Which one is more effective: accessing the desired row with a keyed access directly from the storage medium or loading the hash table into memory first and then doing a find on it? I guess it boils down to a "it depends".

Note that I have close to zero experience with hashes but from what I've read (code samples in SAS-L mainly) I'm under the impression that they aren't the swiss army knife for every data retrieval problem.

Tom
Super User Tom
Super User

Use the $QUOTE. informat to eliminate the quotes around the value in the input file.  Use the TRUNCOVER rather than the MISSOVER option to avoid problems when reading past the end of the physical line. With missover the value becomes missing when you read past the end, but with truncover you get the values that are there instead.

Read the line into a character variable and then parse out the model and make from there.  This will also let you code in a list of models that require two words like ALFA ROMEO.

I also added UPCASE() to make it easier to test for specific values.

data carnames;

  infile cards truncover ;

  length line $100 make model $50 ;

  input line $quote100. ;

  line = upcase(line);

  model = scan(line,1,' ');

  if model in ('ALFA') then model=catx(' ',model,scan(line,2,' '));

  make = substr(line,length(model)+2);

  put ( line model make ) (=/);

cards;

"chevrolet chevelle malibu"

"buick skylark 320"

"plymouth satellite"

"amc rebel sst"

"ford torino"

"ford galaxie 500"

"chevrolet impala"

"plymouth fury iii"

"pontiac catalina"

"amc ambassador dpl"

"citroen ds-21 pallas"

"chevrolet chevelle concours (sw)"

"ford torino (sw)"

"plymouth satellite (sw)"

"amc rebel sst (sw)"

"Alfa Romeo Giulia"

run;

Robert_Bardos
Fluorite | Level 6

Well, it's no doubt doable. That wasn't the goal of the question. Rather to raise the awareness that the problem as it was presented is difficult to solve without some kind of lookup table containing manufacturer's names. Although I can't name one off the top of my head, mergers may have created names consisting of three or more parts.

TomKari
Onyx | Level 15

I agree. I think your only option is to collapse the data by the first three or four strings, and from that derive a table of "make" values. After extracting that, the remainder becomes the "model". There are just too many odd possibilities to automate the detection.

However,one optimization might be to determine if there are only two words, and if there are assume that they are make and model. The makes extracted from that can then be used against the remainder, and any matches can be handled in an automated fashion. This should leave a very small residue for manual evaluation.

I've dealt with the same problem with municipality and province, and there's really no count-based solution.

Tom

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1460 views
  • 0 likes
  • 6 in conversation