DATA Step, Macro, Functions and more

how to read this data

Reply
Occasional Contributor
Posts: 6

how to read this data

 

inputs are  name make  year seat price

 

dedion lamarquise 1884 4 4.6
rolls-royce silver ghost 1912 4 1.7
mercedes-benz  ssk 1929 2 7.4
f-88   1954 . 3.2
ferrari 250testa rossa 1957 2 16.3

 

 

i uesd the code

 

data oldcars;
infile datalines dlm= " " dsd missover;
input  names: $13. make& :$13. year :4. seat 1. price 1.1;
datalines;
dedion lamarquise 1884 4 4.6
rolls-royce silver ghost 1912 4 1.7
mercedes-benz  ssk 1929 2 7.4
f-88   1954 . 3.2
ferrari 250testa rossa 1957 2 16.3
;
run;

 

but only the 1st variable is read correctly..

 

can someone help me with this

BE THE ODD IN MILLIONS
Super Contributor
Posts: 251

Re: how to read this data

[ Edited ]

There are two small problems with your code. One is: when you are using delimiters, you don't want to be specific about the variable length in the input statement, because then that overrides what SAS is doing working out where the delimiters are.

 

The second one is with the Silver Ghost: it has an imbedded space. So I converted the delimiter to a comma throughout, including a double comma for the F-88. I've converted the missing value on that same row to another double comma - the dsd keyword allows it to read missing values without even a warning.

 

So what I've done is to define the variables at the top with the maximum lengths, and formats for how to display them. Note that there are no informats on the input statement.

 

This will work better:

data oldcars;
infile datalines dlm=',' dsd missover;
attrib names make length=$ 13;
attrib year length=4 format=4.;
attrib seat length=3;
attrib price length=8 format=5.1;
input names
      make
      year
      seat
      price;
datalines;
dedion,lamarquise,1884,4,4.6
rolls-royce,silver ghost,1912,4,1.7
mercedes-benz,ssk,1929,2,7.4
f-88,,1954,,3.2
ferrari,250testa rossa,1957,2,16.3
;
run;

 

Occasional Contributor
Posts: 6

Re: how to read this data

Thanks..
BE THE ODD IN MILLIONS
Super User
Posts: 10,500

Re: how to read this data

Please examine this code:

data oldcars;
   infile datalines dlm= " " dsd missover;
   informat  names $13. make  $14. year f4. seat f1. price f1.1;
   input  names  make & year  seat  price ;
datalines;
dedion lamarquise  1884 4 4.6
rolls-royce silver ghost  1912 4 1.7
mercedes-benz ssk  1929 2 7.4
f-88 .  1954 . 3.2
ferrari 250testa rossa  1957 2 16.3
;
run;

When you use space as a delimiter things get ugly quickly. You may have to introduce double spaces to differentiate between the imbedded blanks and separators or . to placehold for missing values.

 

Use of  formats on the input statement starts forcing some reads that may ignore your delimiter.

 

BTW I would tend to think of Rolls-Royce, Ferrari and Mercedes-Benz as the make and Silver Ghost, Testa Rossa and SSK as the model.

 

It is also possible that the code and data I played with wasn't exctly as you have. When you do not post code and data into a code box, opened with the {i} menu, then the forum "translates" your post into html and we have seen aritifacts appear from that translation that cause errors (other than the ones you see when you run the code).

 

Note that I increased the length of Make because you didn't allow sufficient characters to hold "250testa rossa" which is 14 characters.

PROC Star
Posts: 7,363

Re: how to read this data

Assuming you inadvertantly left out "Volvo" as the make of the f-88, the following code might do what you need:

 

data oldcars (drop=_Smiley Happy;
  informat names $20.;
  informat make $20.;
  input @;
  CALL SCAN(_infile_, 2, _position, _length," ");
  _infile_=cat(substr(_infile_,1,_position-1)," ",substr(_infile_,position));
  CALL SCAN(_infile_, -3, _position, _length," ");
_infile_=cat(substr(_infile_,1,_position-1)," ",substr(_infile_,_position));
  input make & model & year seat price;
  datalines;
dedion lamarquise 1884 4 4.6
rolls-royce silver ghost 1912 4 1.7
mercedes-benz ssk 1929 2 7.4
volvo f-88 1954 . 3.2
ferrari 250testa rossa 1957 2 16.3
;
run;

 

HTH,

Art, CEO, AnalystFinder.com

 

Super Contributor
Posts: 251

Re: how to read this data

@art297, I think you're making it too complicated. Using delimited reading with dsd on the infile statement, SAS can do all the heavy lifting. It certainly makes the code more maintainable and extensible.

PROC Star
Posts: 7,363

Re: how to read this data

LaurieF: I'm reposting my code as there were some errors in my original code.

 

However, I disagree! SAS can't always do all of the heavy lifting unless it is given some necessary clues. The ONLY clues my code addsare some necessary double spacing so that SAS can do all of the rest.

 

If there are easier ways to deal with the example data I, for one, wouild definitely like to see it.

 

data oldcars (drop=_Smiley Happy;
  informat make $20.;
  informat model $20.;
  input @;
  CALL SCAN(_infile_, 2, _position, _length," ");
  _infile_=cat(substr(_infile_,1,_position-1)," ",substr(_infile_,_position));
  CALL SCAN(_infile_, -3, _position, _length," ");
  _infile_=cat(substr(_infile_,1,_position-1)," ",substr(_infile_,_position));
  input make & model & year seat price;
  datalines;
dedion lamarquise 1884 4 4.6
rolls-royce silver ghost 1912 4 1.7
mercedes-benz ssk 1929 2 7.4
volvo f-88 1954 . 3.2
ferrari 250testa rossa 1957 2 16.3
;
run;

 

Art, CEO, AnalystFinder.com

Super Contributor
Posts: 251

Re: how to read this data

The hints for dlm are in the attrib/informat statements in my and @ballardw's solutions above - and there were other problems in the original data which were always going to have to be sorted out. I thought it prudent to follow the example of the problem-setter.

PROC Star
Posts: 7,363

Re: how to read this data

LaurieF: Then I'd hope we can at least agree to disagree.

 

I have absolutely nothing against the previously suggested solutions other than that they require the input data to be manually manipulated. My problem with such approaches is that each time one manually makes changes to their data, they risk introducing even more errors than those which might already be present.

 

The OP, in this case, presented a space delimited file. My suggested code was to use a couple of SAS statements to allow the file to be read directly as a space delimited file.

 

If the OP has to deal with thousands, or millions of such records, manual manipulation would simply be out of the question.

 

However, if my approach works for the vast majority of those records, it would be a lot easier for them to use an approach like I suggested, output the successes, and out and manually correct just those records which fail.

 

Art, CEO, AnalystFinder.com

Respected Advisor
Posts: 4,646

Re: how to read this data

Here is an approach using pattern matching that assumes that the name is a single word and the make is anything between the name and the year :

 

data want;
if not prx then 
    prx + prxParse("/\s*(\S*)\s+(.*)([12]\d{3})\s+(\S+)\s+(\S+)/");
input;
if prxMatch(prx, _infile_) then do;
    name = prxPosn(prx,1,_infile_); 
    make = prxPosn(prx,2,_infile_); 
    year = input(prxPosn(prx,3,_infile_),?? best.); 
    seat = input(prxPosn(prx,4,_infile_), ?? best.); 
    price = input(prxPosn(prx,5,_infile_),?? best.);
    end; 
drop prx;
datalines;
dedion lamarquise 1884 4 4.6
rolls-royce silver ghost 1912 4 1.7
mercedes-benz  ssk 1929 2 7.4
volvo f-88   1954 . 3.2
ferrari 250testa rossa 1957 2 16.3
;

proc print data=want noobs; run;
PG
PROC Star
Posts: 7,363

Re: how to read this data

PG: Long time no see! Glad to see that you're still responding here and hope you had the best of holidays.

 

I fully agree that regular expressions allow one to search for patterns that can't easily be accomplished with other functions (if at all), but the problem I have is that the algorithm used in SAS is far too inefficient.

 

Based on your approach, though, I re-wrote my code to follow the same approach using a combination of call scan and the scan function. It was slightly faster than my originally proposed code. However, the regular expression solution took more than ten times longer than the other two approches.

 

Here is the test I ran (on SAS University Edition):

 

data _null_;
file "/folders/myfolders/test.txt";
input;
do _n_=1 to 100000;
put _infile_;
end;
datalines;
dedion lamarquise 1884 4 4.6
rolls-royce silver ghost 1912 4 1.7
mercedes-benz ssk 1929 2 7.4
volvo f-88 1954 . 3.2
ferrari 250testa rossa 1957 2 16.3
mazda mazda 5 2009 4 15.8
;


data want;
if not prx then
prx + prxParse("/\s*(\S*)\s+(.*)([12]\d{3})\s+(\S+)\s+(\S+)/");
infile "/folders/myfolders/test.txt";
input;
if prxMatch(prx, _infile_) then do;
name = prxPosn(prx,1,_infile_);
make = prxPosn(prx,2,_infile_);
year = input(prxPosn(prx,3,_infile_),?? best.);
seat = input(prxPosn(prx,4,_infile_), ?? best.);
price = input(prxPosn(prx,5,_infile_),?? best.);
end;
drop prx;
run;

 

data oldcars (drop=_Smiley Happy;
informat make $20.;
informat model $20.;
infile "/folders/myfolders/test.txt";
input @;
CALL SCAN(_infile_, 2, _position, _length," ");
_infile_=cat(substr(_infile_,1,_position-1)," ",substr(_infile_,_position));
CALL SCAN(_infile_, -3, _position, _length," ");
_infile_=cat(substr(_infile_,1,_position-1)," ",substr(_infile_,_position));
input make & model & year seat price;
run;

 

data want2 (drop=_Smiley Happy;
format make $20.;
format model $20.;
infile "/folders/myfolders/test.txt";
input;
price=input(scan(_infile_,-1,' '),12.);
seat=input(scan(_infile_,-2,' '),12.);
year=input(scan(_infile_,-3,' '),12.);
make=scan(_infile_,1,' ');
CALL SCAN(_infile_, 1, _start, _slength," ");
CALL SCAN(_infile_, -3, _end, _elength," ");
model=substr(_infile_,_slength+2,_end-_slength-3);
run;

 

 CPU time required for regular expression approach: 4.51 seconds

 CPU time required for call scan approach: 0.43 seconds

 CPU time required for call scan and scan function approach: 0.38 seconds

Occasional Contributor
Posts: 6

Re: how to read this data

Thank you . But it looks complicated
BE THE ODD IN MILLIONS
Super Contributor
Posts: 251

Re: how to read this data

I agree. It's jumping through hoops, and it's not extensible. Stick with dsd and a different delimiter from space, if you can.

PROC Star
Posts: 7,363

Re: how to read this data

Whether you need the methods like the ones that @PGStats and I suggested depend on whether you have to work with a file you were given (that looks like the one in your example) or are going to be entering the data yourself.

 

If it's the former, and you don't want to use either regular expressions or scan to identify the necessary fields, then manually inserting delimiters (e.g., commas) between each field is your only other choice. I, myself, avoid doing that whenever possible as it introduces too much risk of making errors by inadvertantly deleting data or misplacing the delimiters. 

 

However, it it is the latter, then creating a comma delimited file is definitely a better way to go.

 

Art, CEO, AnalystFinder.com

 

Ask a Question
Discussion stats
  • 13 replies
  • 296 views
  • 4 likes
  • 5 in conversation