Hi,
Looking for the most efficient method to read this data:
John Matthew 40 $90,000
George Wilson 28 $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
This unaligned data represents name, age and salary variables. The data values are separated by a single space and the variable "name" contains an embedded blank.
Thanks,
Divya
If you know there are always 2 non blank fields that follow you can count backward to find the length of the field and use $VARYING. Using Chang's data.
data one;
infile cards;
input @;
call scan(_infile_,-2,p,l,' ');
l=p-2;
input name $varying30. l age inc :comma9.;
cards;
John Q. Matthew 40 $90,000
George Wilson 28 $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
Hello Kitty . .
;
run;
proc print;
run;
Hi,
Try this..
data one;
infile cards dlm='';
input name $ 1-15 age 16-18 sal :dollar.;
format sal dollar7.;
cards;
John Matthew 40 $90,000
George Wilson 28 $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
;
run;
Thanks,
Shiva
Hi shiva,
Thanks for the response. But the problem I am facing here is name and age data values are separated by a single space and its a long text file so I cannot manually insert spaces between the two and make age values start from the 16th column.
If that was the case, I would have used ampersand format modifier to read it, but here I cannot.
Is there any other way of doing it?
Divya
Hi Divya,
I don't have time now. It may helpful to look at this discussion:
This work for your data as is, you can modify it to fit more complex situations hopefully:
data one (drop=_:);
infile cards dlm='';
input @;
name=catx(' ',scan(_infile_,1), scan(_infile_,2));;
input _1 _2 age sal:dollar11.;
format sal dollar7.;
cards;
John Matthew 40 $90,000
George Wilson 28 $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
;
proc print;run;
Regards,
Haikuo
another approach:
data one(drop=p1 p2) ;
length name $ 20;
infile cards;
input;
p1=findc(_infile_,'0123456789');
p2=find(_infile_,'$');
name=substr(_infile_,1,p1-1);
age=input(substr(_infile_,p1,p2-1),3.);
salary=input(substr(_infile_,p2+1),dollar11.);
format salary dollar11.;
cards;
John Matthew 40 $90,000
George Wilson 28 $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
;
proc print;run;
Obs name age salary
1 John Matthew 40 $90,000
2 George Wilson 28 $50,000
3 Robert Nicolas 30 $60,000
4 Leo Thomas 35 $70,000
If the name always consists of first name and last name then read them directly
and concatenate like this:
data one (drop=_:);
infile cards dlm='';
input _fname :$8. _lname :$15. age sal :dollar11.;
name=catx(' ', _fname, _lname);
format sal dollar7.;
cards;
John Matthew 40 $90,000
George Wilson 28 $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
;
run;
But if the name varies because of middle names and initials, then something like this will work:
data one (drop=_:);
infile cards dlm='';
length _partial_nm $15 name $45;
do while ( 1 );
input _partial_nm :$15. @;
if anydigit( _partial_nm ) > 0 then leave; * found age;
name = catx( ' ', name, _partial_nm );
end;
age = input( _partial_nm, 15. );
input sal :dollar11.;
format sal dollar7.;
cards;
John J. Matthew 40 $90,000
George Wilson 28 $50,000
Robert David Nicolas 30 $60,000
Leo Thomas 35 $70,000
;
run;
or, a modification of Linlin's approach:
data one;
length name $ 20;
infile cards;
input @;
_infile_=catx(" ",substr(_infile_,1,anydigit(_infile_)-1),
substr(_infile_,anydigit(_infile_)));
input name & age salary;
informat salary dollar11.;
cards;
John Matthew 40 $90,000
George Wilson 28 $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
;
hi ... the more the merrier ...
data x;
infile datalines dsd dlm='/';
input @;
_infile_ = translate(_infile_,'/',' ');
substr(_infile_ , find(_infile_,'/') , 1) = ' ';
input name : $30. age salary : comma.;
format salary dollar10.;
datalines;
John Matthew 40 $90,000
George Wilson 28 $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
;
data a;
input (fname lname ) ($) age salary :dollar.;
datalines;
John Matthew 40 $90,000
George Wilson 28 $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
;
data a;
set a;
FullName=trim(fname)||' '||trim(lname);
drop fname lname;
run;
Same idea as Mike's (aka "the _infile_ magic"), but a bit more robust against possible middle names and extra blanks as such. I am pretty sure this is not most "efficient." 🙂
%*-- sub last two occurances of blanks into bars(|) --*;
%let b = \s+; %*-- blanks --*;
%let d = \d+; %*-- digits --*;
%let nb = [^\s]+; %*-- non-blanks --*;
%let nd = [^\d]+; %*-- non-digits --*;
%let prx = s/^(&nd)(&b)(&d)(&b)(&nb)\s*$/$1|$3|$5/;
data one;
infile cards dlm="|";
input @;
_infile_ = prxchange("&prx", 1, _infile_);
input name :$30. age inc :comma9.;
cards;
John Q. Matthew 40 $90,000
George Wilson 28 $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
;
run;
/* check */
proc print data=one;
format inc dollar9.;
run;
/* on lst
Obs name age inc
1 John Q. Matthew 40 $90,000
2 George Wilson 28 $50,000
3 Robert Nicolas 30 $60,000
4 Leo Thomas 35 $70,000
*/
If you know there are always 2 non blank fields that follow you can count backward to find the length of the field and use $VARYING. Using Chang's data.
data one;
infile cards;
input @;
call scan(_infile_,-2,p,l,' ');
l=p-2;
input name $varying30. l age inc :comma9.;
cards;
John Q. Matthew 40 $90,000
George Wilson 28 $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
Hello Kitty . .
;
run;
proc print;
run;
How about:
data x (drop=x:); infile cards truncover ; length name $ 40; input (x1-x3) (:$20.) @; if anydigit(x3) then do; input +(-1*(length(x3)+1)) age sal : dollar12.; call missing(x3); end; else input age sal : dollar12.; name=catx(' ',of x1-x3); cards; John J. Matthew 40 $90,000 George Wilson 28 $50,000 Robert David Nicolas 30 $60,000 Leo Thomas 35 $70,000 Xia Ke Shan 45 $43,432 Hello Kitty . . ; run;
Ksharp
Message was edited by: xia keshan
Adding one more reply to the mix...
data one;
length line $100;
infile cards dlm="|";
input line;
cards;
John Q. Matthew 40 $90,000
George Wilson 28 $50,000
Robert Nicolas 30 $60,000
Leo Thomas 35 $70,000
Thurston Howell The 3rd 65 $1,000,000
Thurston Howell The 3 rd 65 $1,000,000
;
run;
data two;
length name age inc $40;
set one;
rx=prxparse("/^(.*?)(\d+)\s+(.*)$/");
if prxmatch(rx,line) then do;
name=prxposn(rx,1,line);
* age=input(prxposn(rx,2,line),best.);
age=prxposn(rx,2,line);
* inc=input(prxposn(rx,3,line),dollar18.);
inc=prxposn(rx,3,line);
inc2=input(inc,dollar18.);
end;
format inc2 dollar18.;
run;
The regular expression in prxparse is:
1) Lazy capture all characters anchored at the start of the line until you encounter one or more digits separated by one or more spaces. That's why the first "Thurston Howell" line worked but the second one didn't.
2) Capture the first encounter of one or more digits followed by one or more spaces. The lazy capture (.*?) ensures you stop at the first encounter of the digits. Since dot (.) matches ANY character, a greedy capture would capture until the last encounter of digit + a space. Probably not an issue with this data, but still, I think a lazy capture is best here.
3) Greedy capture all characters after the digit plus one or more spaces until the end of the line.
I left name, age, and inc as character values for debugging. Once you're happy with the parsing, you can change the data type and nest the prxposn calls inside the input calls.
HTH,
Scott
P.S.: Can someone explain how you embed the SAS formatted code (color, etc.) into these messages?
Scott
copying from the enhanced editor to Word (2010 for me)
and then copy from word to the answer screen as it is.
I think it is working
data two;
length name age inc $40;
set one;
rx=prxparse("/^(.*?)(\d+)\s+(.*)$/");
if prxmatch(rx,line) then do;
name=prxposn(rx,1,line);
* age=input(prxposn(rx,2,line),best.);
age=prxposn(rx,2,line);
* inc=input(prxposn(rx,3,line),dollar18.);
inc=prxposn(rx,3,line);
inc2=input(inc,dollar18.);
end;
format inc2 dollar18.;
run;
HTH
Andre
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.