Re: Reading data file but only bring in numeric values

jerry898969 · Posted 04-23-2010 10:44 AM

Hello everyone,
I have a file that looks like this
===========================================================
Test Data

Date 3/12/09

97101, 97440, 67431, 55798, 42905, 22590, 31634, 19139, 20920, 33182,
27666, 27532, 25707, 27494

Date 3/16/09
----------------------------------------------

Received 3/14/09

73307,65046,45434,85498,85704,86975,18305,18754,20098,20990,
72034,72389,43949,74336,75148,95609,27495,27686,27689,28710,
89108,69253,59294,69588,81358,91629,31675,31714,31759,33012,
43141,53205

Need to update
===========================================================

How can I just read in the numeric rows and then have them parsed out into a row per number?

This is very urgent so any help would be greatly appreciated.

Thank you

Cynthia_sas · Posted 04-23-2010 11:23 AM

Hi:
I don't understand...does your data file contain
======================================= and
-----------------------------------------------------------------------

and all the white space, or were those separate records that you were trying to indicate???

What about Date 3/12/09 and Date 3/16/09?? Are those IN the file or your descriptions of the data?? Are they separate files? What about the Received 3/14/09??? Again, part of the data or not??

Also, about the data...in your file, is there really a line break between 33182 and 27666 for example, or if you open the file with Notepad, do you see this.
[pre]
97101, 97440, 67431, 55798, 42905, 22590, 31634, 19139, 20920, 33182, 27666, 27532, 25707, 27494
[/pre]

or is there a line break or carriage return at the end of 33182, as shown below:

[pre]
97101, 97440, 67431, 55798, 42905, 22590, 31634, 19139, 20920, 33182,
27666, 27532, 25707, 27494
[/pre]

You say you want one row per number -- by which I think you mean:
[pre]
97101
97440
67431
55798
42905
22590
31634
19139
20920
33182
27666
27532
25707
27494
[/pre]

What would you call that variable Date_3_12_09??? or just NUM or maybe NUMVAR???? Would you want each observation to consist of only one variable or should each observation look like this (partial output):
[pre]
Date Num
3/12/09 97101
3/12/09 97440
3/12/09 67431
[/pre]

You say that you have a "file" is this an ASCII text file (a file that you can open using Notepad)? Is this data in an Excel spreadsheet??

If you have a truly urgent need for help, then you should open a track with Tech Support. In order to help you here, folks really need to see the actual layout of the file you will be reading (with ------ and =====) if those are in the file, but without those cosmetics if they are not in the file.

Can you show us any code that you have already tried??? To learn how to post code and data, refer to this previous forum posting:
http://support.sas.com/forums/thread.jspa?messageID=27609毙

cynthia

jerry898969 · Posted 04-23-2010 12:04 PM

Cynthia,
Thank you for your quick reply. I do apologize for not being clearer.

This is a text file that can be viewed in notepad. It is generated by someone else. My task is to find the rows that start with a numeric character and then pull that row and all following rows till I run out of data then continue on till I find another row starting with numeric data then pull that data out. I don't need to keep any text or dates.
This is exactly what is in the data file. "Need to update" is the last line of the file.

Test Data

Date 3/12/09

97101, 97440, 67431, 55798, 42905, 22590, 31634, 19139, 20920, 33182,
27666, 27532, 25707, 27494

Date 3/16/09
----------------------------------------------

Received 3/14/09

73307,65046,45434,85498,85704,86975,18305,18754,20098,20990,
72034,72389,43949,74336,75148,95609,27495,27686,27689,28710,
89108,69253,59294,69588,81358,91629,31675,31714,31759,33012,
43141,53205

Need to update

This is the code I have so far but this is just testing against a file with just 2 rows of

97101, 97440, 67431, 55798, 42905, 22590, 31634, 19139, 20920, 33182,
27666, 27532, 25707, 27494

[pre]
filename indata "c:\temp\test.txt";
data multi ;
infile indata dlm=',' ;
input num @@ ;
run ;
[/pre]

I wanted to see how I can parse out the data. I got working but I don't know how to check which row is a numeric row or a character row.

Thank you

sbb · Posted 04-23-2010 03:59 PM

For diagnostic review, suggest you add a PUTLOG _ALL_; statement to your program, you would see that the _ERROR_ variable is set to 1 when you cannot perform the input. Then you can perform some other/additional task with a DO/END code piece in your DATA step.

Scott Barry
SBBWorks, Inc.

Report Inappropriate Content · Posted 04-23-2010 04:55 PM

RXPARSE may help you scan a record to see if it has any character values. Here's an example although I had to do two parses to get all the non-numeric rows flagged.

[pre]
data test;
infile 'c:\temp\testfile.txt' dlm=',';
input string :$100. @@;
rx=rxparse("$A");
rx2=rxparse("$'-'");
istext = rxmatch(rx,string) or rxmatch(rx2,string);
run;
[/pre] Message was edited by: 1162

Cynthia_sas · Posted 04-23-2010 05:08 PM

Hi:
I'm sure there's a more elegant way to read this using double trailing at signs (@@) ,but I like this method because it would also give me the opportunity to parse out the dates if I needed them.

First, I put -exactly- these lines into a file, called c:\temp\mixed_data.txt
[pre]
Test Data

Date 3/12/09

97101, 97440, 67431, 55798, 42905, 22590, 31634, 19139, 20920, 33182,
27666, 27532, 25707, 27494

Date 3/16/09
----------------------------------------------

Received 3/14/09

73307,65046,45434,85498,85704,86975,18305,18754,20098,20990,
72034,72389,43949,74336,75148,95609,27495,27686,27689,28710,
89108,69253,59294,69588,81358,91629,31675,31714,31759,33012,
43141,53205

Need to update
[/pre]

Then I use this program to read in 1 line of the file at a time. The INPUT statement uses $VARYING with a length variable in order to read the whole text line and treat each line as a separate entity. (The length variable, LG, comes from using the LENGTH= option on the INFILE statement.) Using $VARYING allows me to read variable length lines -- short lines and long lines without needing to know the exact length of each data line.

I test the first 4 bytes of the line to see whether it is in this list: Date, Need, Rece, Test or ---- or 4 blanks. I also test to see if the line is "empty" -- basically a carriage return without anything else on the line. If the dataline falls into these categories, it is just deleted and control returns to the top of the DATA step to read the next line.

When a dataline doesn't meet any of these conditions, it must be a line with numbers, so the COUNTC function finds out how many commas there are in the line and then the DO loop loops through the line and parses out each number into a character variable. Then a numeric variable is created by using the INPUT function (to turn a character variable into a numeric variable. Each parsed number is output -- so for example, the first data line with numbers contains 10 numbers and so 10 observations are created and the second data line with numbers contains 4 numbers and so 4 observations are created.

Once you understand what the program is doing, I would recommend uncommenting the DROP statement to get rid of unwanted variables. And, of course, you can tweak the program, as needed if you find that there are other stray text lines in the file.

cynthia
[pre]
options ls=120;

data readmixed;
length line $100;
** once you understand what program is doing, then drop unneeded variables;
** DROP line i charvar orig_line_num;
infile 'c:\temp\mixed_data.txt' length=lg;
input @1 line $varying. lg ;
orig_line_num = _n_;
** lg will equal 0 when there is an empty line in the file without any space characters;
if substr(line,1,4) in ('Date', 'Test', 'Rece', 'Need', '----', ' ') or lg=0 then do;
delete;
return;
end;
else do;
num_comma = countc(line,',');
** add 1 because sometimes there is NOT a comma at the end of a line of data;
** if you do an extra scan and there is nothing found, then charvar will be missing or blank;
** and the IF statement will keep that last value from being output.;
do i = 1 to num_comma + 1;
charvar = scan(line,i,',');
numvar = input(charvar,best12.);
if charvar ne ' ' then output;
end;
end;
run;

ods listing;
proc print data=readmixed;
title 'Output Data Set';
run;

[/pre]

Peter_C · Posted 04-26-2010 04:28 AM

*keep it simple
If a line starts with a number, output each number from that line, until there are no more numbers on the line[pre]data numbers ;
infile datalines4 dsd truncover; *comma delimiter by default ;
*get first data item of record/buffer as number ;
input number ?? @ ; *??=avoid invalid data messages, @=hold this buffer ;
do while( number ne . ); *do not output from record when non-numeric is read ;
output ;
input number ?? @ ; * get next number and hold buffer ;
end ;
* with a single trailing @, the infile buffer is released as data step iterates ;
datalines;
Test Data

Date 3/12/09

97101, 97440, 67431, 55798, 42905, 22590, 31634, 19139, 20920, 33182,
27666, 27532, 25707, 27494

Date 3/16/09
----------------------------------------------

Received 3/14/09

73307,65046,45434,85498,85704,86975,18305,18754,20098,20990,
72034,72389,43949,74336,75148,95609,27495,27686,27689,28710,
89108,69253,59294,69588,81358,91629,31675,31714,31759,33012,
43141,53205

Need to update
;;;;

[/pre] truncover added to infile >>> Message was edited by: Peter.C

jerry898969 · Posted 04-26-2010 08:53 AM

Peter,
Your solution seems to be the cleanest for what I need it for.

Thank you and to everyone else's suggestions I pick up a lot of info from everyone posts.

I do appreciate everyone's help.

Cynthia_sas · Posted 04-26-2010 04:04 PM

Peter:
Very elegant!

If the data had come from some of my former customers, however, they would have been back a week later and said, "Hey, you know that file of numbers you gave us, it would be really swell if you could give us the dates that were in that original text file, too." That's how I got into the habit of coding in anticipation of what their next request was going to be. 😉

cynthia

Peter_C · Posted 04-26-2010 05:45 PM

that should require very little more
Before the Datalines statement, insert[pre]
if _infile_ =: 'Date' ;
input @6 set_date : mmddyy. ;
retain set_date ;
attrib set_date format= date9. ;
[/pre] datalines4 statement should have no 4 Message was edited by: Peter.C

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!