BookmarkSubscribeRSS Feed
Darpnew
Calcite | Level 5

Dear Folks,

I am new to SAS and facing difficulty to understand the below point:

The below program is not giving correct output because the data after datalines statement is not following the spacing that is required as per variables defination in 'input' statement. Here, I have only 5 record lines and thus it is easy for me to change the spacing in datalines. But, if I have millions of records and data is coming as in current form (where the 'name' variable is having space in between, how would i define my 'input' statement so that output is correct as required. I tried using colon (:) in the input statement, but couldn't get the desired result. Please help!!

options pagesize=60 linesize=80 pageno=1 nodate;

data sales;

input EmployeeID $ 1-9 Name $ 11-29 @30 HireDate date9.

Salary HomePhone $;

format HireDate date9.;

datalines;

429685482 Martin, Virginia 09aug1990 34800 493-0824

244967839 Singleton, MaryAnn 24apr1995 27900 929-2623

996740216 Leighton, Maurice 16dec1993 32600 933-6908

675443925 Freuler, Carl 15feb1998 29900 493-3993

845729308 Cage, Merce 19oct1992 39800 286-0519

;

proc print data=sales;

title ’Sales Department Employees’;

run;

7 REPLIES 7
Haikuo
Onyx | Level 15

I suppose you have a 'blank' in front of the second name, otherwise it shouldn't be a problem for modified list input (:) plus dlm=' '. There are ways to play with _infile_ variable, but it seems to me a straightforward approach will be reading in 'name' as two variables, and then concatenating them, if needed.

data sales;

infile datalines truncover dlm=', ';

input EmployeeID :$9. LName :$20. FName :$20. HireDate :date9. Salary HomePhone :$20.;

format HireDate date9.;

datalines;

429685482 Martin, Virginia 09aug1990 34800 493-0824

244967839 Singleton, MaryAnn 24apr1995 27900 929-2623

996740216 Leighton, Maurice 16dec1993 32600 933-6908

675443925 Freuler, Carl 15feb1998 29900 493-3993

845729308 Cage, Merce 19oct1992 39800 286-0519

;

ballardw
Super User

Since you have "millions of records" the Datalines is not how you should read the data.

What is the file format of the original data? Are there quotes around text values? Or are the columns fixed width?

If you copy data out of some formats and paste into the editor then the editor is making some default decisions on how to paste and the "copy" is making choices on what is placed on the clipboard. and may be losing information in the original data set.

For instance this is the result of copying a line of text from a CSV file opened in Excel and pasted into the SAS editor:

4329 603 1 6/21/1990 83202 1 13 4/9/2015 4 2 3 2

However in the CSV file we have when opened in Wordpad and pasted:

4329,603,1,6/21/1990,83202,1,"1, 3",4/9/2015,4,2,3,2,

Reading the later from a file with a comma delimiter the text of your values with spaces will most likely have quotes around them and not be a problem to read. Similar options involve other types of files.

Datalines also restricts the options that can be used to read the data.

Ksharp
Super User

Is the name always two words ? or one word ?


data sales;
infile datalines truncover ;
input EmployeeID : $20. temp $100.;
name=substr(temp,1,anydigit(temp)-1);
HireDate=scan(temp,-3,' ');
Salary=scan(temp,-2,' ');
HomePhone=scan(temp,-1,' ');
drop temp;
datalines;
429685482 Martin, Virginia 09aug1990 34800 493-0824
244967839 Singleton, MaryAnn 24apr1995 27900 929-2623
996740216 Leighton, Maurice 16dec1993 32600 933-6908
675443925 Freuler, Carl 15feb1998 29900 493-3993
845729308 Cage, Merce 19oct1992 39800 286-0519
;

Xia Keshan

Darpnew
Calcite | Level 5

Hi Xia,

Thanks , appreciate your willingness to guide me.

The name can be 1, 2 or even three characters with spaces / and / or comma in between..

Regards

data_null__
Jade | Level 19

Based on your example it would be logical given 3 fields following NAME to SCAN from the end of _INFILE_ to find the beginning of HireDate.  With this you can calculate the length of NAME and read it with $VARYING.

data Emp;
   infile datalines col=col;
   informat employeeid $12. Name $varying64. HireDate date9. salary comma12. phone $8.;
  
input EmployeeID @;
   call scan(_infile_,-3,s,e,' ');
   l = s-col;
  
input Name $varying64. l HireDate Salary Phone;
   format HireDate date11. salary dollar12.;
  
drop l s e;
   datalines;
429685482 Martin, Virginia 09aug1990 34800 493-0824
244967839 Singleton, MaryAnn 24apr1995 27900 929-2623
996740216 Leighton, Maurice 16dec1993 32600 933-6908
675443925 Freuler, Carl 15feb1998 29900 493-3993
845729308 Cage, Merce 19oct1992 39800 286-0519
;;;;
   run;
proc print;
  
run;
Haikuo
Onyx | Level 15

Similar to 's approach,

data sales;

     infile datalines truncover;

     input @;

     _patternID = prxparse('/\d\D+\d/io');

     call prxsubstr(_patternID, _infile_, _position, _len);

     _len=_len-3;

     input EmployeeID $9.

           @11 Name $varying30. _len

           @(12+_len)

        HireDate :date9.

        Salary

        HomePhone $;

     drop _:;

     format HireDate date9.;

     datalines;

429685482 Martin, KM. Virginia 09aug1990 34800 493-0824

244967839 Singleton, G MaryAnn 24apr1995 27900 929-2623

996740216 Leighton, Maurice 16dec1993 32600 933-6908

675443925 Freuler, Mike Carl 15feb1998 29900 493-3993

845729308 Cage, Merce 19oct1992 39800 286-0519

;

zwubbena
Calcite | Level 5

Problem

The original code from the question:

data sales;
input EmployeeID $ 1-9 Name $ 11-29 @30 HireDate date9. Salary HomePhone $;
format HireDate date9.;
datalines;
429685482 Martin, Virginia 09aug1990 34800 493-0824
244967839 Singleton, MaryAnn 24apr1995 27900 929-2623
996740216 Leighton, Maurice 16dec1993 32600 933-6908
675443925 Freuler, Carl 15feb1998 29900 493-3993
845729308 Cage, Merce 19oct1992 39800 286-0519
;
proc print data=sales; run;

This code produces the following incorrect output:
Annotation 2022-11-22 214000.png

 

Solution

The correct code for the 'name' var defines the input with an "&" and requires rows to have two spaces between columns. SAS reads in the next variable when the space between columns is more than two (>2):

data sales1;
input EmployeeID $9. Name & $29. HireDate date9. Salary HomePhone $;
format HireDate date9.;
datalines;
429685482  Martin, Virginia  09aug1990  34800  493-0824
244967839  Singleton, MaryAnn  24apr1995  27900  929-2623
996740216  Leighton, Maurice  16dec1993  32600  933-6908
675443925  Freuler, Carl  15feb1998  29900  493-3993
845729308  Cage, Merce  19oct1992  39800  286-0519
;
proc print data=sales1; run;

The following output shows what I think you're looking for:

c.png

For more information, see Method 3 here: HOW DO I READ IN A CHARACTER VARIABLE WITH VARYING LENGTH IN A SPACE DELIMITED DATASET? | SAS FAQ 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 23752 views
  • 5 likes
  • 6 in conversation