Need help with reading in a text file

ZoeyElle · Posted 07-24-2011 09:40 PM

Hi everyone!

I have a text file containing data as shown below:

ABC

04885423

04886522

048809876

4666324557677879

5754246789532123

DEF

4321234567897654

5434456434565211

GHI

0466643

089876665

0987643349

ABC

097457738

73638933

03988833

The string 'ABC', 'DEF', and 'GHI' is like a group header.

What I need to do is this:

I need to read in the data and separate them into different datasets based on the header the are under.

For example, when i read the 1st header which is 'ABC' the 5 observations under it will be written to a dataset, say data1.

Then when i hit the 2nd header which is 'DEF' the 2 observations under it will be written to another dataset, say data 2.

Same goes for 3rd header.

When i come to the 4th header, which is 'ABC', the observations to be added to dataset data1 if possible.

Thanks in advance!

art297 · Posted 07-25-2011 01:15 AM

I'm sure there is a more direct way, but the following will do what I think you are trying to accomplish:

data have;

informat thedata $20.;

input thedata;

cards;

ABC

04885423

04886522

048809876

4666324557677879

5754246789532123

DEF

4321234567897654

5434456434565211

GHI

0466643

089876665

0987643349

ABC

097457738

73638933

03988833

;

proc sql noprint;

select distinct thedata

into : distinctfiles separated by " "

from have

where anyalpha(thedata) gt 0

;

select distinct 'if filenm eq "'||strip(thedata)||'" then output '||thedata||";"

into : ifstmt separated by " "

from have

where anyalpha(thedata) gt 0

;

quit;

data &distinctfiles.;

set have;

retain filenm;

if anyalpha(thedata) then filenm=thedata;

else do;

&ifstmt.

end;

run;

ZoeyElle · Posted 07-25-2011 05:18 AM

Hi Art,

Thanks for the reply.

Tried the code suggested. I'm hitting some syntax errors. I think because my actual data is more complicated than what i had posted originally.

1) some of the headers consist of two words, and some have special characters.

2) some of the data(non header) has alphabets.

I hit syntax errors for observations that hit above 2 scenarios.

ABC XYZ

04885423

04886522T

0488G09876

4666324557677879

DEF

4321234567897654

5434456434565211

GHI #CVD

0466643

089876665

0987643349

ABC

097457738

73638933

03988833

Would really appreciate your advice.

Ksharp · Posted 07-25-2011 04:19 AM

It is almost like Art.T 's code.

data have;
  input thedata $20.;
  retain tname;
  if anyalpha(thedata) then do; tname=thedata; delete;end;
  cards;
ABC
04885423
04886522
048809876
4666324557677879
5754246789532123
DEF
4321234567897654
5434456434565211
GHI
0466643
089876665
0987643349
ABC
097457738
73638933
03988833
;
run;
proc sort data=have;by tname;run;
data _null_;
 declare hash ha (hashexp:10);
  ha.definekey('tname','count');
  ha.definedata('thedata');
  ha.definedone();
  do until(last.tname);
   set have;
   by tname;
   count+1;
   ha.add();
  end;
 ha.output(dataset: tname);
run;

Ksharp

ZoeyElle · Posted 07-25-2011 05:39 AM

Hi ksharp,

Hitting some errors as well, if the headers and observations are more complicated as in my 2nd post.

Would appreciate your advice as well.

My apologies for not noticing all the patterns of the data.

art297 · Posted 07-25-2011 07:46 AM

Just add in some conditions to limit what you are looking for and ensure that valid SAS filenames are derived. E.g.:

data have;

informat thedata $20.;

input thedata;

cards;

ABC XYZ

04885423

04886522T

0488G09876

4666324557677879

DEF

4321234567897654

5434456434565211

GHI #CVD

0466643

089876665

0987643349

ABC

097457738

73638933

03988833

;

proc sql noprint;

select distinct compress(thedata,,'kdf')

into : distinctfiles separated by " "

from have

where anyalpha(substr(thedata,1,1)) gt 0

;

select distinct 'if filenm eq "'||strip(compress(thedata,,'kdf'))||

'" then output '||compress(thedata,,'kdf')||";"

into : ifstmt separated by " "

from have

where anyalpha(substr(thedata,1,1)) gt 0

;

quit;

data &distinctfiles.;

set have;

retain filenm;

if anyalpha(substr(thedata,1,1)) then

filenm=compress(thedata,,'kdf');

else do;

&ifstmt.

end;

run;

data_null__ · Posted 07-25-2011 09:06 AM

This can all be done in one step. My TEST for "is it a GROUP row" may not be strict enough for your real data. Also, it looks like the data rows may contain more data than a simple string of numbers. If that is the case the ELSE DO can be change to an INPUT statement to read the fields properly.

filename FT15F001 temp;

data groupinput;

attrib group length=$20.;

retain group;

attrib value length=$64.;

infile FT15F001 truncover;

input @;

if anyalpha(first(_infile_)) then do;

group = _infile_;

delete;

end;

else do;

value = _infile_;

output;

end;

parmcards;

ABC

04885423

04886522

048809876

4666324557677879

5754246789532123

DEF

4321234567897654

5434456434565211

GHI

0466643

089876665

0987643349

ABC

097457738

73638933

03988833

ABC XYZ

04885423

04886522T

0488G09876

4666324557677879

DEF

4321234567897654

5434456434565211

GHI #CVD

0466643

089876665

0987643349

ABC

097457738

73638933

03988833

;;;;

run;

proc print;

run;

art297 · Posted 07-25-2011 09:21 AM

DN, you missed something that the OP specified: "I need to read in the data and separate them into different datasets based on the header the are under."

However, I was hoping that you would look at this as I couldn't remember how to take a character string and apply it to an outfile statement. I got around that by creating a macro variable via proc sql, but I am sure you could come up with something using inputc, putc or the like.

data_null__ · Posted 07-25-2011 09:49 AM

Yes I did miss that part about creating separate data sets.

If the data are grouped properly then the new data sets could be created in one pass using the HASH method posted by ksharp and my imput method. Some of the group names are not proper MEMBER names. OR if not grouped then sort the output from my program by group and use the HASH. Or the code gen method you posted. Both are adequate for the task we don't have enough info to know which would be better depending on the number of groups or total obs etc.

For your question regarding outfile are you referring to the FILE statement option FILEVAR?

art297 · Posted 07-25-2011 09:55 AM

No, I was referring to the following scenario:

data &distinctfiles.;

set have;

retain filenm;

if anyalpha(substr(thedata,1,1)) then

filenm=compress(thedata,,'kdf');

else do;

&ifstmt.

end;

run;

in place of the &ifstmt. that I created in proc sql, how could one specify: output filenm;

data_null__ · Posted 07-25-2011 10:00 AM

You can't. The OUTPUT statement accepts only name(s) not expressions. You need output METHOD to use expressions and the associated HASH bits.

Ksharp · Posted 07-26-2011 12:04 AM

you just need to change the code a little.

data have;
  input thedata $20.;
  retain tname;
  if anyalpha(thedata) eq 1 then do; tname=thedata; delete;end; 
  cards;
   ABC XYZ
04885423
04886522T
0488G09876
4666324557677879
DEF
4321234567897654
5434456434565211
GHI #CVD
0466643
089876665
0987643349
ABC
097457738
73638933
03988833
;
run;

proc sort data=have;by tname;run;
data _null_;
 declare hash ha (hashexp:10);
  ha.definekey('tname','count');
  ha.definedata('thedata');
  ha.definedone();
  do until(last.tname);
   set have;
   by tname;
   count+1;
   ha.add();
  end;
 ha.output(dataset: scan(tname,1)||strip(count));
run;

Ksharp