BookmarkSubscribeRSS Feed
angorwat
Calcite | Level 5

Hi All,

I have an csv file with around 5 fields, one of this field (field 3) is an free text column in the source system.

I.E., user can type an message in it, so this field has carriage return value, special char, quotes, alt+enter.

Also, length of the field 3 is variable, depends on the user message.

So that data of the field 3 is distributed in multiple lines per observation in few rows and in one line per observation in few rows, basically depends on the length of the text.

Fields description

Field 1 – emp id

Field 2  - dept id

Field 3 –  free text, it includes comma or special character, carriage return

Field 4  - Product code

Field 5  - country code

Sample data

001,12A, Tues to Friday John rotates with Team, Perm 10800-1645 Sat
Perm 8am starts Tue to Sat as of 22/02
, cards, UK

002,12B, Mon to Wed Marry rotates with Team, Perm 0800-1645 Sat
Perm 8am starts Tue to Sat as of 22/02

Works in shift, HL, UK

003,12c, Sat&Sun Paul rotates with Team, Perm 19000-1645 Sat,HL, UK

004,12D, All day Joe rotates with PL Team, Perm 10800-1645 Sat
8am starts Tue to Sat as of 24/02

Works in shift, CC, UK

Even I am okay if I exclude the field 3 and read the rest of the fields correctly.

I tried with PROC Import but not able to find an option of excluding the particular column/fields.

Field 3 data highlighted in red colour.

Please suggest an approach to read this csv file.

15 REPLIES 15
HongqiuGu
Calcite | Level 5

It's ok for me .

I think it may related to your SAS Version. BTW, my SAS Version is 9.3 TS1M2.

pradeepalankar
Obsidian | Level 7

try this code:

data output(drop=line);

infile 'infile.csv'  truncover;

input line $1000.;

field1=scan(line,1,',');

field2=scan(line,2,',');

field3=compress(substr(line,index(line,trim(scan(line,2,',')))+length(trim(scan(line,2,',')))+1,

((index(line,trim(scan(line,-2,',')))-1)-(index(line,trim(scan(line,2,',')))+length(trim(scan(line,2,',')))+1))),'"');

field4=scan(line,-2,',');

field5=scan(line,-1,',');

run;

angorwat
Calcite | Level 5

Thanks, I will try this code.

@ Hongqiugu - do u meant the carriage return value works in 9.3 or did used the sample data


btw - can we exclude any specific field/column from the csv file before reading into SAS.???

HongqiuGu
Calcite | Level 5

I used the data you posted.

csv file is ";" separated file, actually it is the same as  txt file. you can choose what fields you want to read in or drop for csv files.but no options available  for excel files as far as I know.

pradeepalankar
Obsidian | Level 7

CSV is comma separated file ',' not semicolon separated.

HongqiuGu
Calcite | Level 5

Yeah,  thanks. You are right. I'm a little confused about the two words 'comma' and 'semicolon'.    I am not a native English speaker.

jaredp
Quartz | Level 8

Just thought I'd add some thoughts to what a few posters have already touched on.

One issue I see is that your example data is not a properly formatted CSV file.  When I create CSV files, I encapsulate my free text fields with double quotations "like this".  This tells most software that everything between the quotes is part of one column, regardless of what it contains.  The only issue is when your free text field itself contains double quotes.  This will cause software to cut off your data prematurely.

One option is to change how the source data is generated.  Instead of using a comma, use a set of characters as your delimiter which are less likely to appear in your free text field, such as #*#*# .  You then split your data based on that delimeter as opposed to splitting on a comma.

You could also try an approach similar to pradeepalankarI personally like this approach.  Get your first two and last two fields.  Everything else is your third field. 

It sounds like you have flexibility for what your final data set will look like.  With the code already suggested, you easily aim for this data set:

Field1    Field2    Field3    Field4    Field5
001       12A       cards     UK        001,12A, Tues to Friday John rotates with Team, Perm 10800-1645 Sat Perm 8am starts Tue to Sat as of 22/02, cards, UK
002       12B       HL         UK        002,12B, Mon to Wed Marry rotates with Team, Perm 0800-1645 Sat Perm 8am starts Tue to Sat as of 22/02 Works in shift, HL, UK
003       12c        HL        UK         003,12c, Sat&Sun Paul rotates with Team, Perm 19000-1645 Sat,HL, UK
004       12D       CC        UK         004,12D, All day Joe rotates with PL Team, Perm 10800-1645 Sat 8am starts Tue to Sat as of 24/02 Works in shift, CC, UK


Amir
PROC Star

Hi Angorwat,

As the original file can contain commas as separators and as data, then any data values containing a comma need to be distinguished by being surround by quotes, for example.

But if, as you say, the data could contain any text, which could include quotes, then perhaps CSV is not the best format for this data to be in. Can it be provided to you in a different format to make things easier?

If it cannot then perhaps look for patterns to distinguish the 3rd field, e.g. is it always preceded and followed by 2 fields separated only with a comma each, although I would not even consider that full proof.

Regards,

Amir.

BrunoMueller
SAS Super FREQ

Hi angorwat

You might be able to use the TERMSTR option to get what you want. If your field 3 also contains the separator character (,) then the value for field 3 is usually in quotes.

See also this discussion

The following code will read a CSV file created in excel as outline by you

data test;
  infile "c:\temp\csv_cr.csv"
   
dsd
   
dlm=","
   
missover
   
termstr=cr
  ;
  input
    emp_id :
$8.
    dept_id :
$8.
    free_text :
$1024.
    Product_code :
$8.
    country_code :
$8.
  ;
run;
Tom
Super User Tom
Super User

You should check how your file is storing end of line markers to see if it will be possible to distinguish the line breaks in the middle of a row of data from the end of the row.  It is easy to see this information using a SAS datastep using the LIST statement.

data _null_;

  infile 'myfile.csv' obs=10;

  input;

  list;

run;

You can also use RECFM=F to see all of the characters including the end of line characters.

data _null_;

  infile 'myfile.csv' recfm=F lrecl=80 obs=10;

  input;

  list;

run;

On a Windows machine the end of line is normally created using the two character sequence CR+LF (carriage return and linefeed) which are hexadecimal codes 0D and 0A, respectively.  If the line breaks in the middle of your data are only coded as one of those (usually it will be the CR or '0D'x) then you can use the TERMSTR option on the INFILE statement.

So you probably want something like this:

data want ;

  infile 'myfile.csv' dsd dlm=',' lrecl=100000 termstr=crlf truncover ;

  length empid $10 deptid $10 product $10 country $3 comment $500 ;

  input empid deptid comment product country;

run;

art297
Opal | Level 21

If your example data actually closely mimic your real data (specifically, if the first two fields are never missing and are always in the form of ### and ##C), then you might be able to use that pattern to reconstruct the records into a form that the datastep can recognize.

Try the following code (but change the 2nd line to reflect the location and name of your actual data):

data want (keep=a b c d e);

  infile "c:\textin.txt" dlm="," dsd truncover end=lastrec;

  informat a b $3.;

  informat c $1024.;

  informat d $8.;

  informat e $8.;

  length stuff holdstuff $1024;

  retain stuff;

  input @;

  if lastrec or (prxmatch('#\d{3},\d{2}[a-zA-Z]#o',_infile_) eq 1 and _n_ gt 1) then do;

    if lastrec then stuff=catt(stuff,_infile_);

    holdstuff=_infile_;

    call scan(stuff,-2,pos,len,",");

    stuff=catt(substr(stuff,1,pos-2),'",',substr(stuff,pos));

    _infile_=stuff;

    input a b c d e;

    output;

      _infile_=holdstuff;

      stuff=catt(substr(_infile_,1,8),'"',substr(_infile_,9));

  end;

  else do;

    if prxmatch('#\d{3},\d{2}[a-zA-Z]#o',_infile_) eq 1 then do;

      stuff=catt(substr(_infile_,1,8),'"',substr(_infile_,9));

    end;

    else stuff=catt(stuff,_infile_);

    input;

  end;

run;

p.s. While i'm sure that he will think the above far more complex than it has to be, thanks to for providing me with the prxmatch function specifics

angorwat
Calcite | Level 5

Thanks all for your responses.

I managed to fix this problem using the code from 26065 - Remove carriage return and linefeed characters within quoted strings.

art297
Opal | Level 21

Glad you solved your problem, but a number of us are not very happy that you wasted our time by not letting us know what your data really looked like.  Your example data didn't include ANY quoted strings.  If it had, obviously, different solutions would have been offered.

angorwat
Calcite | Level 5

Sorry about that, I didn't realized it

Will correct myself, ensure to provide sample data similar to real data next time.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 6546 views
  • 5 likes
  • 9 in conversation