BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SAShole
Pyrite | Level 9

I'm trying to import clickstream data but i have a field that has multiple commas. i know that the next field always starts with double quotes. maybe there is a scan solution for this?

Here is what i'm using to import:

data Sample                       ;

%let _EFIERR_ = 0; /* set the ERROR detection macro variable */

  infile 'E:\Jeff B\Open\ClickStream\Sample.txt' delimiter = ',' MISSOVER DSD lrecl=32767 ;

  informat Date  date11.;

  informat DateTime time. ;

  informat IP $15. ;

  informat DOM $50. ;

  informat DOM_SUB $255. ;

  informat DEV $255.;

  informat DUR best12.;

  format Date date9. ;

  format DateTime time18.3;

  format IP $15. ;

  format DOM $50. ;

  format DOM_SUB $255. ;

  format DEV $255.;

  format DUR best12.;

  input

  Date

  DateTime

  IP $

  DOM $

  DOM_SUB $

  DEV $

  DUR;

if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */

run;

*please note that record #3 and #7 are not being read in correctly;

Thanks -- Sample data attached.

I've attached the full data set.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You seem to have 7 fields.  The 5th one seems to have unprotected commas. The 6th one seems to always start with a space and then a double quote.  The 7th looks like a number.

Here is one way to read it.  You can adjust the variable names, types and lengths to match your actual values.

data sample;

infile '~/temp/sample.txt' termstr=crlf lrecl=10000 dsd truncover col=cc ;

length x1-x7 $2000 ;

input x1-x4 @ ;

start=cc ;

input @' "' @;

end=cc-1;

x5 = substr(_infile_,start,end-start-2);

input @end x6 x7;

if _n_ in (3,7) then put (_n_ x1-x7) (=/);

run;

View solution in original post

4 REPLIES 4
Tom
Super User Tom
Super User

You seem to have 7 fields.  The 5th one seems to have unprotected commas. The 6th one seems to always start with a space and then a double quote.  The 7th looks like a number.

Here is one way to read it.  You can adjust the variable names, types and lengths to match your actual values.

data sample;

infile '~/temp/sample.txt' termstr=crlf lrecl=10000 dsd truncover col=cc ;

length x1-x7 $2000 ;

input x1-x4 @ ;

start=cc ;

input @' "' @;

end=cc-1;

x5 = substr(_infile_,start,end-start-2);

input @end x6 x7;

if _n_ in (3,7) then put (_n_ x1-x7) (=/);

run;

SAShole
Pyrite | Level 9

works great - Thanks tom!

SAShole
Pyrite | Level 9

Tom,

When I try this on the live dataset i'm only able to import the first row. The log file indicates that 'one or more rows might be getting truncated. Here is what the log file looks like:

123  data sample;

124      infile 'E:\Jeff B\Open\ClickStream\Data/centris-11252013.txt'

125      termstr=crlf lrecl=10000 dsd truncover col=cc ;

126      length x1-x7 $2000 ;

127          input x1-x4 @ ;

128              start=cc ;

129              input @' "' @;

130              end=cc-1;

131              x5 = substr(_infile_,start,end-start-2);

132          input @end x6 x7;

133      if _n_ in (3,7) then put (_n_ x1-x7) (=/);

134  run;

NOTE: The infile 'E:\Jeff B\Open\ClickStream\Data/centris-11252013.txt' is:

      Filename=E:\Jeff B\Open\ClickStream\Data\centris-11252013.txt,

      RECFM=V,LRECL=10000,File Size (bytes)=9123388,

      Last Modified=26Nov2013:06:41:42,

      Create Time=26Nov2013:13:34:26

NOTE: 1 record was read from the infile 'E:\Jeff B\Open\ClickStream\Data/centris-11252013.txt'.

      The minimum record length was 10000.

      The maximum record length was 10000.

      One or more lines were truncated.

NOTE: The data set WORK.SAMPLE has 1 observations and 9 variables.

NOTE: DATA statement used (Total process time):

      real time           0.13 seconds

      cpu time            0.14 seconds

any thoughts on what might be causing this?

Thanks!

Tom
Super User Tom
Super User

Most likely the end of line characters are not carriage return and line feed like you coded in your INFILE statement.

You might also need to up the LRECL.

The simple way to check is to read the and let SAS show you want is in the first record.

data _null_;

  infile 'E:\Jeff B\Open\ClickStream\Data/centris-11252013.txt'  termstr=crlf lrecl=100000 ;

  input;

  list ;

  stop;

run;

When I do that on my Unix machine I see that SAS is showing that the last character has hexadecimal code value of '0A'x which is a line feed.

You can eliminate the TERMSTR option or set it to LF and try again.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 2464 views
  • 0 likes
  • 2 in conversation