Solved: Re: Reading Embedded Spaces in Data

Divya · Posted 05-10-2012 12:30 PM

Hi,

Looking for the most efficient method to read this data:

John Matthew 40 $90,000

George Wilson 28 $50,000

Robert Nicolas 30 $60,000

Leo Thomas 35 $70,000

This unaligned data represents name, age and salary variables. The data values are separated by a single space and the variable "name" contains an embedded blank.

Thanks,

Divya

data_null__ · Posted 05-10-2012 06:48 PM

If you know there are always 2 non blank fields that follow you can count backward to find the length of the field and use $VARYING. Using Chang's data.

data one;

infile cards;

input @;

call scan(_infile_,-2,p,l,' ');

l=p-2;

input name $varying30. l age inc :comma9.;

cards;

John Q. Matthew 40 $90,000

George Wilson 28 $50,000

Robert Nicolas 30 $60,000

Leo Thomas 35 $70,000

Hello Kitty . .

;

run;

proc print;

run;

View solution in original post

shivas · Posted 05-10-2012 12:40 PM

Hi,

Try this..

data one;

infile cards dlm='';

input name $ 1-15 age 16-18 sal :dollar.;

format sal dollar7.;

cards;

John Matthew 40 $90,000

George Wilson 28 $50,000

Robert Nicolas 30 $60,000

Leo Thomas 35 $70,000

;

run;

Thanks,

Shiva

Divya · Posted 05-10-2012 12:49 PM

Hi shiva,

Thanks for the response. But the problem I am facing here is name and age data values are separated by a single space and its a long text file so I cannot manually insert spaces between the two and make age values start from the 16th column.

If that was the case, I would have used ampersand format modifier to read it, but here I cannot.

Is there any other way of doing it?

Divya

Linlin · Posted 05-10-2012 12:55 PM

Hi Divya,

I don't have time now. It may helpful to look at this discussion:

https://communities.sas.com/message/112959#112959

Haikuo · Posted 05-10-2012 01:00 PM

This work for your data as is, you can modify it to fit more complex situations hopefully:

data one (drop=_:);

infile cards dlm='';

input @;

name=catx(' ',scan(_infile_,1), scan(_infile_,2));;

input _1 _2 age sal:dollar11.;

format sal dollar7.;

cards;

John Matthew 40 $90,000

George Wilson 28 $50,000

Robert Nicolas 30 $60,000

Leo Thomas 35 $70,000

;

proc print;run;

Regards,

Haikuo

Linlin · Posted 05-10-2012 02:44 PM

another approach:

data one(drop=p1 p2) ;

length name $ 20;
infile cards;
input;
p1=findc(_infile_,'0123456789');
p2=find(_infile_,'$');
name=substr(_infile_,1,p1-1);
age=input(substr(_infile_,p1,p2-1),3.);
salary=input(substr(_infile_,p2+1),dollar11.);
format salary dollar11.;
cards;
John Matthew 40 $90,000
George Wilson 28 $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
;
proc print;run;
Obs name age salary

                              1     John Matthew       40        $90,000
                              2     George Wilson      28        $50,000
                              3     Robert Nicolas     30        $60,000
                              4     Leo Thomas         35        $70,000

FloydNevseta · Posted 05-10-2012 03:00 PM

If the name always consists of first name and last name then read them directly

and concatenate like this:

data one (drop=_:);

infile cards dlm='';

input _fname :$8. _lname :$15. age sal :dollar11.;

name=catx(' ', _fname, _lname);

format sal dollar7.;

cards;

John Matthew 40 $90,000

George Wilson 28 $50,000

Robert Nicolas 30 $60,000

Leo Thomas 35 $70,000

;

run;

But if the name varies because of middle names and initials, then something like this will work:

data one (drop=_:);

infile cards dlm='';

length _partial_nm $15 name $45;

do while ( 1 );

input _partial_nm :$15. @;

if anydigit( _partial_nm ) > 0 then leave; * found age;

name = catx( ' ', name, _partial_nm );

end;

age = input( _partial_nm, 15. );

input sal :dollar11.;

format sal dollar7.;

cards;

John J. Matthew 40 $90,000

George Wilson 28 $50,000

Robert David Nicolas 30 $60,000

Leo Thomas 35 $70,000

;

run;

art297 · Posted 05-10-2012 03:08 PM

or, a modification of Linlin's approach:

data one;

length name $ 20;

infile cards;

input @;

_infile_=catx(" ",substr(_infile_,1,anydigit(_infile_)-1),

substr(_infile_,anydigit(_infile_)));

input name & age salary;

informat salary dollar11.;

cards;

John Matthew 40 $90,000

George Wilson 28 $50,000

Robert Nicolas 30 $60,000

Leo Thomas 35 $70,000

;

MikeZdeb · Posted 05-10-2012 04:08 PM

hi ... the more the merrier ...

data x;

infile datalines dsd dlm='/';

input @;

_infile_ = translate(_infile_,'/',' ');

substr(_infile_ , find(_infile_,'/') , 1) = ' ';

input name : $30. age salary : comma.;

format salary dollar10.;

datalines;

John Matthew 40 $90,000

George Wilson 28 $50,000

Robert Nicolas 30 $60,000

Leo Thomas 35 $70,000

;

Anne · Posted 05-10-2012 04:34 PM

data a;

input (fname lname ) ($) age salary :dollar.;

datalines;

John Matthew 40 $90,000

George Wilson 28 $50,000

Robert Nicolas 30 $60,000

Leo Thomas 35 $70,000

;

data a;

set a;

FullName=trim(fname)||' '||trim(lname);

drop fname lname;

run;

chang_y_chung_hotmail_com · Posted 05-10-2012 06:14 PM

Same idea as Mike's (aka "the _infile_ magic"), but a bit more robust against possible middle names and extra blanks as such. I am pretty sure this is not most "efficient." 🙂

%*-- sub last two occurances of blanks into bars(|) --*;
%let b = \s+;      %*-- blanks     --*;
%let d = \d+;      %*-- digits     --*;
%let nb = [^\s]+; %*-- non-blanks --*;
%let nd = [^\d]+; %*-- non-digits --*;
%let prx = s/^(&nd)(&b)(&d)(&b)(&nb)\s*$/$1|$3|$5/;

data one;
   infile cards dlm="|";
   input @;
   _infile_ = prxchange("&prx", 1, _infile_);
   input name :$30. age inc :comma9.;
cards;
John Q. Matthew 40 $90,000
George Wilson 28    $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
;
run;

/* check */
proc print data=one;
   format inc dollar9.;
run;
/* on lst
Obs    name               age          inc

1     John Q. Matthew     40      $90,000
2     George Wilson       28      $50,000
3     Robert Nicolas      30      $60,000
4     Leo Thomas          35      $70,000
*/

data_null__ · Posted 05-10-2012 06:48 PM

If you know there are always 2 non blank fields that follow you can count backward to find the length of the field and use $VARYING. Using Chang's data.

data one;

infile cards;

input @;

call scan(_infile_,-2,p,l,' ');

l=p-2;

input name $varying30. l age inc :comma9.;

cards;

John Q. Matthew 40 $90,000

George Wilson 28 $50,000

Robert Nicolas 30 $60,000

Leo Thomas 35 $70,000

Hello Kitty . .

;

run;

proc print;

run;

Ksharp · Posted 05-10-2012 11:30 PM

How about:

data x (drop=x:);
infile cards truncover ;
length name $ 40;
input (x1-x3) (:$20.) @;
if anydigit(x3) then do;
                       input +(-1*(length(x3)+1)) age sal : dollar12.;
                       call missing(x3);
                     end;
 else input age sal : dollar12.;
name=catx(' ',of x1-x3);
cards;
John J. Matthew 40 $90,000
George Wilson 28 $50,000
Robert David Nicolas 30 $60,000
Leo Thomas 35 $70,000
Xia Ke Shan 45 $43,432
Hello    Kitty . .
;
run;

Ksharp

Message was edited by: xia keshan

ScottBass · Posted 05-11-2012 02:49 AM

Adding one more reply to the mix...

data one;

length line $100;

infile cards dlm="|";

input line;

cards;

John Q. Matthew 40 $90,000

George Wilson 28 $50,000

Robert Nicolas 30 $60,000

Leo Thomas 35 $70,000

Thurston Howell The 3rd 65 $1,000,000

Thurston Howell The 3 rd 65 $1,000,000

;

run;

data two;

length name age inc $40;

set one;

rx=prxparse("/^(.*?)(\d+)\s+(.*)$/");

if prxmatch(rx,line) then do;

name=prxposn(rx,1,line);

* age=input(prxposn(rx,2,line),best.);

age=prxposn(rx,2,line);

* inc=input(prxposn(rx,3,line),dollar18.);

inc=prxposn(rx,3,line);

inc2=input(inc,dollar18.);

end;

format inc2 dollar18.;

run;

The regular expression in prxparse is:

1) Lazy capture all characters anchored at the start of the line until you encounter one or more digits separated by one or more spaces. That's why the first "Thurston Howell" line worked but the second one didn't.

2) Capture the first encounter of one or more digits followed by one or more spaces. The lazy capture (.*?) ensures you stop at the first encounter of the digits. Since dot (.) matches ANY character, a greedy capture would capture until the last encounter of digit + a space. Probably not an issue with this data, but still, I think a lazy capture is best here.

3) Greedy capture all characters after the digit plus one or more spaces until the end of the line.

I left name, age, and inc as character values for debugging. Once you're happy with the parsing, you can change the data type and nest the prxposn calls inside the input calls.

HTH,

Scott

P.S.: Can someone explain how you embed the SAS formatted code (color, etc.) into these messages?

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

Andre · Posted 05-11-2012 04:15 AM

Scott

copying from the enhanced editor to Word (2010 for me)

and then copy from word to the answer screen as it is.

I think it is working

data two;

length name age inc $40;

set one;

rx=prxparse("/^(.*?)(\d+)\s+(.*)$/");

if prxmatch(rx,line) then do;

name=prxposn(rx,1,line);

* age=input(prxposn(rx,2,line),best.);

age=prxposn(rx,2,line);

* inc=input(prxposn(rx,3,line),dollar18.);

inc=prxposn(rx,3,line);

inc2=input(inc,dollar18.);

end;

format inc2 dollar18.;

run;

HTH

Andre

Registration is open

SAS Training: Just a Click Away