BookmarkSubscribeRSS Feed
Scottcom5
Calcite | Level 5

Hi Guys,

I have a CSV which is pipe delimited.and doesn't include any text identifiers (the individual strings are not surrounded by double inverted commas or anything else for that matter).  Unfortunately some of the text strings contain embedded carriage returns that I obviously need to remove in order to read the file in effectively.

I have searched the internet high and low for a solution and attempted to use what I have located without success.  The solution with the most potential seemed to be a datastep with both infile and file statements using the sharebuffers option.  For some unexplained reason this method simply kills my E.G session (which I know sounds crazy, but that is what happens).  The following code is a reasonable representation (although an over simplification) of what how the data would look if I imported the CSV file.

DATA STUFF;

LENGTH TEXT $1000;

INPUT TEXT $;

DATALINES;

VAR1|VAR2|VAR3|VAR4|VAR5|VAR6|VAR7|VAR8|VAR9|VAR10|VAR11|VAR12

1|2|3|4

|5|6|7|8|

9|10|11|12 ;

RUN;

To get around this I thought that I could read the first line of the text file (containing the variable names which never have embedded carriage returns) into a single variable and identify the number of pipes (11).  I would then do the same for the remainder of the file and where the number of pipes are less than 11 then append the next record (by lagging the first occurance) to the second occurance and then delete any records that don't have 11 pipes.  This works perfectly until you come across an instance where there are multiple carriage returns for a single record.

Can anyone offer a suggestion on how I could continue this process until the number of pipes equals 11?

I am open to other solutions, but this is the best I could come up with.

Any help would be greatly appreciated.

Regards,

Scott

10 REPLIES 10
data_null__
Jade | Level 19

You don't mention the OS you are using.  That would be helpful.

I think the TERMSTR INFILE statement option will solve this problem.  Try TERMSTR=CRLF.

If that doesn't work be more specific about the offending character (is is '0A'x?) and your OS.

ballardw
Super User

Depending on how messy your data is you might also try the FLOWOVER option on the INFILE statement. This option attempts to continue to read data from the next line when an unexpected end of line occurs.

Scottcom5
Calcite | Level 5

Hi BallardW,

I did try the Flowover option, but was getting some unexpected results.

Thank you for your suggestion though..

Regards,

Scott

ballardw
Super User

Perhaps you could show an actual copy of the infile and input statements you are using.

My comment about messy data basically meant that if you have empty fields there's likely to be problems keeping the fields synched.

Instead of LAG you might be better off looking at the pointer control input and @ and @@ and parsing data.

It sounds like you have multiple files with different variables from your comment about reading the first line. Will that give you enough information to know whether the variables are text or numeric? Or will you assume all are text and covert any numeric later?


Scottcom5
Calcite | Level 5

Hi data_null_,

I have attempted to use TERMSTR but with no joy.  From the documentation I have read this is only useful when processing a file created in Unix using Windows (and vice versa) as it specifies the end-of-line character for the specified file, which is a CR ('0D'x in this example.

We are using Windows Server 2008 R2 if that helps.

Regards,

Scott

data_null__
Jade | Level 19

Attache the file or a few records.

Scottcom5
Calcite | Level 5

Sorry, I can't provide a sample as the data is confidential.

data_null__
Jade | Level 19

I cannot reproduce your problem with imbeded CR '0D'x in the records. Using SAS 9.2 for winders.  I can produce the problem with LF '0A'x and it can be fixed with TERMSTR.  If you are using SAS 9.1 I recall that TERMSTR did not work.

1981  filename FT44F001 temp;

1982  data _null_;

1983     file FT44F001;

1984     put 'VAR1|VAR2|VAR3|VAR4|VAR5|VAR6|VAR7|VAR8|VAR9|VAR10|VAR11|VAR12';

1985     put '1|2|3|4' '0a'x '|5|6|7|8|' '0a'x '9|10|11|12';

1986     put '1|2|3|4|5|6|7|8|9|10|11|12';

1987     put '1|2|3|4' '0a'x '|5|6|7|8|' '0a'x '9|10|11|12';

1988     run;

NOTE: The file FT44F001 is:

      (system-specific pathname),

      (system-specific file attributes)

NOTE: 4 records were written to the file (system-specific pathname).

      The minimum record length was 26.

      The maximum record length was 62.

NOTE: DATA statement used (Total process time):

      real time           0.06 seconds

      cpu time            0.01 seconds

1989  data _null_;

1990     infile FT44F001;

1991     input;

1992     list;

1993     run;

NOTE: The infile FT44F001 is:

      (system-specific pathname),

      (system-specific file attributes)

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0

1         VAR1|VAR2|VAR3|VAR4|VAR5|VAR6|VAR7|VAR8|VAR9|VAR10|VAR11|VAR12 62

2         1|2|3|4 7

3         |5|6|7|8| 9

4         9|10|11|12 10

5         1|2|3|4|5|6|7|8|9|10|11|12 26

6         1|2|3|4 7

7         |5|6|7|8| 9

8         9|10|11|12 10

NOTE: 8 records were read from the infile (system-specific pathname).

      The minimum record length was 7.

      The maximum record length was 62.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

1994

1995  data _null_;

1996     infile FT44F001 termstr=crlf;

1997     input;

1998     list;

1999     run;

NOTE: The infile FT44F001 is:

      (system-specific pathname),

      (system-specific file attributes)

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0

1         VAR1|VAR2|VAR3|VAR4|VAR5|VAR6|VAR7|VAR8|VAR9|VAR10|VAR11|VAR12 62

2   CHAR  1|2|3|4.|5|6|7|8|.9|10|11|12 28

    ZONE  3737373073737373703733733733

    NUMR  1C2C3C4AC5C6C7C8CA9C10C11C12

3         1|2|3|4|5|6|7|8|9|10|11|12 26

4   CHAR  1|2|3|4.|5|6|7|8|.9|10|11|12 28

    ZONE  3737373073737373703733733733

    NUMR  1C2C3C4AC5C6C7C8CA9C10C11C12

NOTE: 4 records were read from the infile (system-specific pathname).

      The minimum record length was 26.

      The maximum record length was 62.

NOTE: DATA statement used (Total process time):

      real time           0.04 seconds

      cpu time            0.01 seconds

Ksharp
Super User

Or Can you make some dummy data to appear your problem again.

And Best is posting the output you want too.

Oleg_L
Obsidian | Level 7

This may be helpful:

https:/communities.sas.com/thread/34015

Oleg

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 3572 views
  • 6 likes
  • 5 in conversation