BookmarkSubscribeRSS Feed
hinear
Fluorite | Level 6

Hi,

 

I got some problem when I am trying to import the .d data. The data contains a very long characters. Although it is covered by open and close quotation mark, SAS is not able to recognise it.

 

For example: Content of the data (test.d)

"Row1" "A AAAAAAAAAAAAAAAAAAAAAAAAA

BBBBBBBBBBBC

CCCCCCCCCCCCCCCC"

 

"Row2"  "A AAAAAAAAAAAAAAAAAAAAAAAAA

BBBBBBBBBBBC

CCCCCCCCCCCCCCCC"

 

 

 

 

My SAS code :-

 

data test;

%let _EFIERR_ = 0;

infile "&ImportLocation.test.d" delimiter=' ' MISSOVER DSD lrecl=32767;

informat

 Var1 $ 4.

 Var2 $ 999.;

format

Var1 $ 4.

Var2 $ 999.;

input

 Var1 $

Var2 $ ;

run;

 

 

 

Output  :-

Var1            Var2

Row1           "A

BBBB

CCCC

 

Row2          "A

BBBB

CCCC

 

 

It is not able to read the full content inside Variables2, the open and close quotation mark is not working.

 

How can I adjust my SAS code to read the data in variables2?

 

Thz

16 REPLIES 16
Kurt_Bremser
Super User

When reading from text files, the line separator takes precedence over the quotes that denote columns. I solved the problem by writing a small C filter program that replaces linefeeds with a <BR> tag everywhere the count of quotes is uneven.

But I think there is a solution with using recfm=n and reading the instream column by column.

hinear
Fluorite | Level 6
thz for reply, can you show me how to write a small C filter program please?
Kurt_Bremser
Super User

This is the source of the code:

#include <stdio.h>
#include <unistd.h>

main (argc,argv)
int argc;
char *argv[];
{
        FILE *ifp;
        FILE *ofp;
        int c;
        int flag;
        int iflag = 0;
        int oflag = 0;
        char *infile = NULL;
        char *outfile = NULL;
        opterr = 0;
        while ((c = getopt(argc,argv,"i:o:")) != -1)
                switch (c) {
                        case 'i':
                                iflag = 1;
                                infile = optarg;
                                break;
                        case 'o':
                                oflag = 1;
                                outfile = optarg;
                                break;
                        case '?':
                                if (optopt == 'c')
                                fprintf (stderr, "Option -%c requires an argumen
t.\n", optopt);
                                else if (isprint (optopt))
                                fprintf (stderr, "Unknown option `-%c'.\n", opto
pt);
                                else
                                fprintf (stderr,
                                "Unknown option character `\\x%x'.\n",
                                optopt);
                                return 1;
                }
        if (iflag == 1) {
                ifp = fopen(infile,"r");
                if (ifp == NULL) {
                        fprintf(stderr, "could not open infile %s\n",infile);
                        return 2;
                }
        }
        else {
                ifp = stdin;
        }
        if (oflag == 1) {
                ofp = fopen(outfile,"w");
                if (ofp == NULL) {
                        fprintf(stderr, "could not open outfile %s\n",outfile);
                        return 3;
                }
                else {
                        fprintf(stdout,"converting %s to %s\n",infile,outfile);
                }
        }
        else {
                ofp = stdout;
        }

        flag = 0;
        while ((c = fgetc(ifp)) != EOF ) {
                if (c == 39) {
                        flag = abs(flag - 1);
                }
                if ((c == 10) && (flag == 1)) {
                        fputc('<',ofp);
                        fputc('B',ofp);
                        fputc('R',ofp);
                        fputc('>',ofp);
                }
                else {
                        fputc(c,ofp);
                }
        }
}

It uses the standard C function getopt() to retrieve the commandline arguments (infile and outfile names); if no name is supplied, stdin and/or stdout are used, respectively.

The main program starts at flag = 0; and just loops until all bytes are read.

You need to have gcc and the basic C libraries installed to compile the code. You could also port the logic to any other programming lagnuage of your choice. The nice thing about C is the sheer speed of the executable.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Just to add, its a good idea to name the file extension to directly reference what the file contains (e.g. Excel files are XLSX, Word is DOCX, SAS programs are SAS etc.).  .D is a file association with the D programming language, so naming your file this way can cause unusual effects.  If this is a text based delimited file, then .txt or .dlm or .csv (for comma delimited) should be used.

 

For a SAS based solution (note, updated this slightly from the SAS support provided code from post:

https://communities.sas.com/t5/General-SAS-Programming/find-carriage-return-in-dataset/m-p/177855/hi...

Not tested it, but should work fine.

 

data _null_;
     infile "s:\temp\rob\x.csv" recfm=n;
     file "s:\temp\rob\y.csv" recfm=n;
     retain flag 0;
     input a $char1.;
     if a = '"' then flag=ifn(flag=0,1,0);
     if not(flag=1 and a in ('0D'x,'0A'x)) then put a $char1.;
run;

This reads in the file bit by bit, and drops carriage return characters from the the file.

 

hinear
Fluorite | Level 6

Unfortunately the file that I download from the Progress system is a .d file

 

I cannot change it to .csv or .txt  >..<

Kurt_Bremser
Super User

After downloading, you can rename the file on your computer to any extension you want, including .this_is_the_silliest_extension_I_could_come_up_with.


@hinear wrote:

Unfortunately the file that I download from the Progress system is a .d file

 

I cannot change it to .csv or .txt  >..<


 

andreas_lds
Jade | Level 19
data test(keep= Var1 Var2);
   length 
      Var1 $ 4 Var2 $ 999
      buffer $ 32000
      rx 8
   ;

   retain 
      buffer ' ' 
      rx 
   ;

   if _n_ = 1 then do;
      rx = prxparse('/"(.+)".*"(.+)"/');
   end;
      
   infile "&ImportLocation.test.d" delimiter=' ' MISSOVER DSD lrecl=32767; 
   input;

   if lengthn(_infile_) = 0 then delete;

   if countc(_infile_, '"') = 3 then do;
      buffer = _infile_;
   end;
   else do;
      buffer = cats(buffer, _infile_);
   end;

   if mod(countc(buffer, '"'), 2) = 0 then do;
      if prxmatch(rx, buffer) then do;
         Var1 = prxposn(rx, 1, buffer);
         Var2 = prxposn(rx, 2, buffer);
         output;
      end;
   end;

run;

just another idea ....

hinear
Fluorite | Level 6

Thank you very much, it is working. However, my data also contain some variables(number and date), which does not have open and close quotation mark.

 

For example: Content of the data (test.d)

"Row1" 123 24/09/2019 "A AAAAAAAAAAAAAAAAAAAAAAAAA

BBBBBBBBBBBC

CCCCCCCCCCCCCCCC"

 

"Row2" 234 24/09/2019 "A AAAAAAAAAAAAAAAAAAAAAAAAA

BBBBBBBBBBBC

CCCCCCCCCCCCCCCC"

 

How to apply the same logic on it?

Tom
Super User Tom
Super User

You should look at the file and see what it is actually using as the delimiter and as end of line markers.  You can easily use SAS to do this if you do not have a text editor that can do it for you.  For example to list the 1,000 bytes of your file you could use code like this.

data _null_;
   infile 'myfile.d' lrecl=100 recfm=f obs=10;
   input;
   list;
run;

If you are lucky your file will have <CR><LF> at the end of a record and just <CR> or just <LF> as the splits in the middle of the long lines. For example like this:

1213  data _null_;
1214    infile test lrecl=50 recfm=f obs=10;
1215    input;
1216    list;
1217  run;

NOTE: The infile TEST is:

      Filename=...
      RECFM=F,LRECL=50,File Size (bytes)=166,
      Last Modified=10Jul2017:10:17:30,
      Create Time=10Jul2017:10:17:30

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9--
1         "Row1" 123 24/09/2019 "A AAAAAAAAAAAAAAAAAAAAAAAAA

2   CHAR  .BBBBBBBBBBBC.CCCCCCCCCCCCCCCC".."Row2" 234 24/09/
    ZONE  04444444444440444444444444444420025673223332332332
    NUMR  D222222222223D33333333333333332DA22F7220234024F09F

3   CHAR  2019 "A AAAAAAAAAAAAAAAAAAAAAAAAA.BBBBBBBBBBBC.CCC
    ZONE  33332242444444444444444444444444404444444444440444
    NUMR  201902101111111111111111111111111D222222222223D333

4   CHAR  CCCCCCCCCCCCC".. 16
    ZONE  4444444444444200
    NUMR  33333333333332DA
NOTE: 4 records were read from the infile TEST.

So you could read it like this.

data want ;
  infile test dsd dlm=' ' truncover termstr=crlf ;
  length var1 $10 var2 var3 8 var4 $1000 ;
  informat var3 ddmmyy.;
  format var3 yymmdd10.;
  input var1-var4 ;
  var4 = translate(var4,'||','0D0A'x);
run;

Capture.PNG

hinear
Fluorite | Level 6

Unfortunately, my data has <CR><LF> in the end of each row.

 

"Row1" 123 24/09/2019 "A AAAAAAAAAAAAAAAAAAAAAAAAA<CR><LF>
<CR><LF>
BBBBBBBBBBBC<CR><LF>
<CR><LF>
CCCCCCCCCCCCCCCC" <CR><LF>

 

"Row2" 234 24/09/2019 "A AAAAAAAAAAAAAAAAAAAAAAAAA<CR><LF>
<CR><LF>
BBBBBBBBBBBC<CR><LF>
<CR><LF>
CCCCCCCCCCCCCCCC" <CR><LF>

 

 

I used your SAS code, but the output look like the following:-

 

var1

var2

var3

var4

Row1

123

2019-09-24

"A

    

BBBBBBBBBB

   
    

CCCCCCCCCC

   
    
    
    

Row2

234

2019-09-24

"A

    

BBBBBBBBBB

   
    

CCCCCCCCCC

   
Tom
Super User Tom
Super User

You will first need to pre-process the file to remove the embedded CRLF characters.

The simple quote counting technique that @RW9 posted on this thread should do the trick for this file since the strings with the embedded CRLF are quoted.

Then you can read the fixed file using normal delimited file input syntax.

hinear
Fluorite | Level 6

 

 

I finally contact sas support and they give a similar solution.

the method is replace the carriage return and linefeed characters inside the double quotation mark by '@' and '$'

 

http://support.sas.com/kb/26/065.html

 

 

Thank you for all of your help. Otherwise, I don't even notice that there are <CR><LF> inside the file.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Which, you will see from my post above had already been supplied by support and posted above.

Ksharp
Super User

Try option recfm=n .

 

data test;

%let _EFIERR_ = 0;

infile "&ImportLocation.test.d" delimiter=' ' MISSOVER DSD lrecl=3276700 recfm=n ;

informat

 Var1 $ 4.

 Var2 $ 999.;

format

Var1 $ 4.

Var2 $ 999.;

input

 Var1 $

Var2 $ @@;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 1136 views
  • 6 likes
  • 6 in conversation