Re: Reading from a single Long Line @@

athapa1183 · Posted 08-29-2018 09:45 AM

The data is very long and just a single row. I just have 4 segments from the long line.

CLM*178*322.56***12+B+1*Y*A*Y*Y**NTE*19*94289*WER~CLM*123 *322.56***11+A+1*Y*A*Y*Y* *90289*RRW~CLM*223 *322.56***11+A+1*Y*A*Y*Y* NTE*11*93289*RRW~CLM*120 *322.56***11+A+1*Y*A*Y*Y* NTE*17*91289*SEW~

I want

CLM	NTE
178	19
123	Blank
223	11
120	17

I have almost 15,000 CLM in the long line.

Sometimes the NTE segment is itself missing and it messes up my output.

Here is my code :

data a_111;
infile 'E:\Users\Documents and Settings\57260\Desktop\test\.W.160712102723.086.dat' lrecl = 1000000000 ;
input @'CLM*' string $3. @'NTE*' COS $2. @@;
run;

how to account for when NTE segment is missing.

I have attached a small sample of data

I will appreciate your help in this regard.

RW9 · Posted 08-29-2018 10:05 AM

Post test data in the form of a datastep!!

As such I am once again guessing:

data want;
  set have;
  do i=1 to countw(your_string,"CLM");
    new_line=scan(your_string,i,"CLM");
    clm=scan(new_line,1,"*");
    output;
  end;
run;

You can then use the same sort of scan() function to split out parts of the remaining string.

athapa1183 · Posted 08-29-2018 10:18 AM

THis will not take care of missing NTE Segment.

Thank YOu

RW9 · Posted 08-29-2018 10:19 AM

And this is of course why my final line was:

"You can then use the same sort of scan() function to split out parts of the remaining string."

You can work out how to do that part from the code I have provided?

athapa1183 · Posted 08-29-2018 10:25 AM

Thank YOu RW9.

the file is a dat file with only 1 long single line. Not a dataset yet.

RW9 · Posted 08-29-2018 10:30 AM

You need to know what the file structure is then. I assume that it is * delimited, so that would be step one:

data want;
  infile "...dat" dlm="*" missover dsd;
  input ...;
run;

From there you would add to the input statement the structure of the file you are reading. In this way you would then end up with the elements in variables.

This: lrecl = 1000000000

Is not going to work however:

http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003160472.htm

There is a maximum to the length. It would be simpler to get the file with line breaks (perhaps it already does, maybe you need to change the termstr:

http://support.sas.com/kb/14/178.html

I.e. did the file come from unix and you are using on windows perhaps or vice versa, you can use notepad++ or one of the text editors to look and see).

athapa1183 · Posted 08-29-2018 10:39 AM

I tried all of it. Does not work with @@.

dlm="*" missover dsd

RW9 · Posted 08-29-2018 10:49 AM

No, you do not use @@ with that line. The file you get, is it from unix, is it from windows. These have different line endings, one has line feed + carriage return, the other has just carriage return. From the documentation:

http://support.sas.com/kb/14/178.html

Something like:

  infile "...dat" dlm="*" missover dsd termstr=lf;

It is very unlikely that your file is one long string.

If however it is, then return to source and get a better file.

The only other solution is to try to go down the route of reading it block at a time, then parsing out each individual element, which will be a real pain for you. then you would need the @@:

data inter;
  infile "...dat";
  input long_string $ @@;
  /* do lots of processing here to split data out, work out if you have gone over a block etc. */
run;

Even then its probably not going to work due to line length longer than possible. I would really recommend checking out the termstr option, or trying to get a better file before you try that route however.

athapa1183 · Posted 08-29-2018 10:52 AM

it is in windows.

it is really a long single line separated by ~

if you see the 4 segments in the attached text file,i posted they all end in ~

but they are together in 1 long line

Thank You

ballardw · Posted 08-29-2018 11:20 AM

And what documentation accompanied this file as to content, layout or value types? For instance, how do you know that CLM is always three characters? Or NTE is two characters when present? These lengths are explicit from your input statement.

Also, how do you know that there are "almost 15,000 CLM "? Did you manually count them (doubtful)?

If anyone expected me to use such a file without some description I would likely have someone whipped with wet spaghetti noodles until some sort of clarification was offered.

It appears that perhaps the ~ character is some sort of record or field separator.

If that is the case 1) extract one segment using

2) parse that segment for the CLM and NTE

3) output the values

4) repeat for the next segment.

athapa1183 · Posted 08-29-2018 11:56 AM

hi BALLARDW,

i made it simple.

Yes the CLM is always 3 characters and NTE is 2

i opened the file in notepad++ to get the counts

Even if i do that approach

then how to account for missing NTE segments.

FreelanceReinh · Posted 08-29-2018 12:31 PM

Hi @athapa1183,

Try this: Add the options

termstr='~' missover

to your INFILE statement and delete the "@@" from your INPUT statement.

With your test data a1.txt it worked on my Windows machine. Of course, this assumes that "~" is the record separator.

Edit: The TRUNCOVER option would be safer than MISSOVER if NTE had some shorter values (i.e. length 1) which occurred at the end of a record (i.e. before the "~" character), but you wrote that this is not the case.

athapa1183 · Posted 08-29-2018 02:00 PM

Hi FreelanceReinhard

Thank You for the suggestion.

Yes i did see that it works. I thought this simple piece will work in my big dataset.

Now i am attaching a portion of my big dataset.I have changed it little though.

How to handle missing value when the NTE segment is itself missing

Need the highlighted parts

CLM*15176002400002*18104.49***11+A+1*

NTE*ADD*11899~

AMT*D*1260~

My code

data a_111;
infile 'E:\Users\Documents and Settings\57260\Desktop\s1.TXT' termstr='~' missover lrecl = 1000000000 ;
input @'CLM*' string $21. @'***' +5 FREQ $1. @'NTE*TPO*' COS $5. @'AMT*D*' Paid_Amt1 $6. ;

RUN;

i WANT :

CLM	FRQ	NTE	AMT
15176002400002	1	11899	1260
18199000180002	1	87599	8377
12345678400002	1		1234
22121000180002	1		21

I really appreciate your help

Thank You

ballardw · Posted 08-29-2018 02:25 PM

@athapa1183 wrote:

Hi FreelanceReinhard

Thank You for the suggestion.

Yes i did see that it works. I thought this simple piece will work in my big dataset.

Now i am attaching a portion of my big dataset.I have changed it little though.

How to handle missing value when the NTE segment is itself missing

Need the highlighted parts

CLM*15176002400002*18104.49***11+A+1*

NTE*ADD*11899~

AMT*D*1260~

And now we are back to once again asking: where is the document that describes the data file.

The file you just posted does not have the ~ as a possible record separator the way the earlier example did.

Or there are multiple records that have missing CLM and NTE.

For those interested, if we consider the ~ as a record separator from the previous context then this what the latest example file looks like for a bit of the file:

CLM*15176002400002*18104.49***11+A+1**A*Y*Y~
DCVS*434*RD8*3543453453~
DTP*435*DT*21106091700~
DTP*096*TM*1900~
CL1*1*1*01~
AMT*F3*0~
REF*D9*050GXLO5~
REF*EA*521220~
NTE*ADD*11899~
HI*BK+11214~HI*BJ+47774~
HI*BF+SDFAS*BF+30000+++++++Y*BF+34580+++++++Y*BF+4659+++++++Y*BF+30401+++++++Y*BF+4019+++++++Y*BF+4928+++++++Y~
HI*DR+0000~
HI*BQ+455+D8+3434342*BQ+9921+D8+201232322*BQ+9232+D8+11111*BQ+9390+D8+20150609~
HI*BR+9394+D8+20150609~
HI*BH+11+D8+20150609*BH+A1+D8+19460616*BH+B1+D8+19460616~
HI*BE+01+++623*BE+80+++3*BE+A3+++18104.49~
HCP*02*9636.72****0191~
NM1*77*2*GFHJSW3*****XX*345345345~
N3*DDASWFQQ~
N4*SDFASDF*WERWER*100879998~
SBR*P*18*DUAL***6***MC~
AMT*D*1260~
NM1*IL*1*JJGHGWGF*J***MI*ZZ53898X~
N3*4808 GASDFGSDFG~
N4*SGREGRGTER*11235~
NM1*PR*2*AREGTERTGERGTAW*****PI*40024~
DTP*573*D8*725572~
REF*F8*2452452452~
SBR*S*18*DFGDFG300***1***MB~

Which bears almost no resemblance to the previous "example" other than being one long undocumented mess.

The previous example had every occurrence of CLM (except the beginning of the line) as immediately after the ~ and when the NTE value was present it came before the following ~.

And what is the rule to find the value of 1 out of "11+A+1"

athapa1183 · Posted 08-29-2018 03:39 PM

the files are just like the sample data.

The CLM are not missing.

Only some NTE segments are missing.

~ yes this is record separator but you can use any separator you like to get the results.

what is the rule to find the value of 1 out of "11+A+1"

ANS: there is no rule we just need that 1 from "11+A+1"

Thank You

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!