BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LineMoon
Lapis Lazuli | Level 10

 

Hello experts,

 

I want to import a files (.txt, .xls, xlsx) into sas 9.2 without taking care about the "delimiter" ,  "textsize" or guessingrow more than  32767

I do not want to explore file, the program will be avalaible for all files( .txt,xls, xlsx)

 

I want to do this :

 

Step 1

 

%let file=V1/How_ever_theFile.txt   / * or .xls, .xlsx  */

%import(file=&file);

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

I am not big fan of PROC IMPORT because it has to guess about what your varaibles are. I find that I can write the data step to read the file almost faster than I can write the code to run PROC IMPORT.  And then I know what it is doing.

 

PROC IMPORT will NOT guess the delimiter for you.  You need to tell it what delimiter to use.

 

It is not hard for a human to find the delimiter.  If you have command prompt just type out the first few lines and look at them.  If you can't do that then just ask SAS to do it.

 

data _null_;
  infile "&file" obs=5 ;
  input;
  list;
run;

If you want to make a program do that then I guess you could try testing a set of common delimiters and see if they make sense. Probably the one that finds the most words is the right one.

data _null_;
  infile "&file" obs=1;
  input ;
  do dlm=',','09'x,'|',';',':' ;
    nwords = countw(_infile_,dlm,'Q');
    put dlm= $hex2. ' ' nwords=;
  end;
run;

View solution in original post

19 REPLIES 19
Reeza
Super User

Write a macro,that will handle all your cases. 

LineMoon
Lapis Lazuli | Level 10

Yes, with sas documentation, we can program the cases possible.

But how can  we identify the "delimiter" without opening the file ? and making more performance guessingrow , even geater than 32767 ?. I know that, I can imoprt automatcally, then use log file , to identify the "delimiter", but I do not want to use this method, it can be very hard

 

Reeza
Super User

AFAIk there isn't a way to detect the delimiter that's going to fall into the 'easy' category. 

 

You can use input/infile to read an entire row regardless of delimiter and try and parse it out. 

 

There res are entire courses on how to read data in R/Python. I'm not sure why you're trying to do this, it seems unrealistic and you're not going to gain much. In fact the time spent developing this will probably never be gained back. 

 

Anyways, good luck with your macro. 

LineMoon
Lapis Lazuli | Level 10

Thank you for your answer, reeza.

Really ? t is very realistic by using sas 9.2, if you have so many files from differents sources, with a differents "delimiter" and you must have to use only sas, the output of the import, will be used in so many sas applications.. we can use "sas log" or checking the external files one by one...but this action will take much time ...that's why , an automatic solution can be better ..

Reeza
Super User

I'm think of it this way - Proc Import has been developed by programmers dedicated to this task. They've decided that providing the minimum of a delimiter and file type will provide the best result. It still usually results in issues, there are questions on here that illustrate this. 

 

I dont doubt that things can be improved - they always can. I just question the use of time. If my staff had time to work in this they don't have enough work 😉

 

Dealing with the different file types xlsx, Xls, and txt files is worth it. Determining the delimiter doesn't seem worth it. My opinion and you can obviously spend time on whatever you're interested in. You can't program your way out of bad processes. 

 

Here's a line from a real text file, The number of genres at the end can vary. 

This is the Movielens dataset if your interested in a file to test your process with. 

 

12::Dracula: Dead and Loving It (1995)::Comedy|Horror

 

 

 

 

 

Reeza
Super User

Also, 9.2 is pretty old, but a I don't think there's been significant changes to dealing with text files, mostly Excel files. 

Tom
Super User Tom
Super User

What have you tried?  How did it not meet your needs?

 

LineMoon
Lapis Lazuli | Level 10

Hello TOM

Thank you for your interst

What have you tried?  

 

I have tried two methods:

First one:

To identify the "delimiter" in the file , and running the proc import

 

the second one: running the import, and using the "the log file" to identify the "delimiter"

 

How did it not meet your needs?

using the last methods for a so many  and many files ...il will take a much time.

That's why , I am looking for an automatic and quick program

I

Reeza
Super User

Post the code you've tried and your logic for determining a delimiter. 

Tom
Super User Tom
Super User

I am not big fan of PROC IMPORT because it has to guess about what your varaibles are. I find that I can write the data step to read the file almost faster than I can write the code to run PROC IMPORT.  And then I know what it is doing.

 

PROC IMPORT will NOT guess the delimiter for you.  You need to tell it what delimiter to use.

 

It is not hard for a human to find the delimiter.  If you have command prompt just type out the first few lines and look at them.  If you can't do that then just ask SAS to do it.

 

data _null_;
  infile "&file" obs=5 ;
  input;
  list;
run;

If you want to make a program do that then I guess you could try testing a set of common delimiters and see if they make sense. Probably the one that finds the most words is the right one.

data _null_;
  infile "&file" obs=1;
  input ;
  do dlm=',','09'x,'|',';',':' ;
    nwords = countw(_infile_,dlm,'Q');
    put dlm= $hex2. ' ' nwords=;
  end;
run;
LineMoon
Lapis Lazuli | Level 10

@Tom thank you  very much for your answer, it's  a geat answer.

You said "I find that I can write the data step to read the file almost faster than I can write the code to run PROC IMPORT"

I am not cloded on "proc import", I am open for an other solution, please can you give more detail  about your solution? 

In your solution, you do not need to explore the external file( vaiables: type, length ) ?

 

Thank you again

Tom
Super User Tom
Super User

If you have no idea what is in your data then let PROC IMPORT try to figure it out.

 

But the problem with that is that if you are receiving multiple versions of data in the same format than PROC IMPORT will make different decisions about lengths of character variables (or even the whether the variable is numeric or character) based on what records happen to be in the version it is currently looking at. Then when you try to combine the datasets later you get problems. 

 

It is better to know what is in your data and write the data step to read it.  Then the structure will not change.

 

If you did want to implement your own code that might apply different rules for guessing what is in the data then start by reading the data into a simple structure and analyzing what you find.  You could just read everything into character variables. So this will read in upto 100 columns into character variables.

 

data raw ;
  infile "&file" dsd truncover lrecl=100000 ;
  length x1-x100 $500 ;
  input x1-x100;
run;

Or you could read it into an even simplier structure that has ROW, COL and VALUE.

data raw ;
  infile "&file" dsd truncover lrecl=100000 column=ccol length=llen;
  length row col 8 value $500 ;
  row+1;
  do col=1 by 1 until (ccol > llen);
    input value @ ;
    output;
  end;
run;

 

LineMoon
Lapis Lazuli | Level 10

@Tom: Thank you very much, that's very kind from you.

as you said "start by reading the data into a simple structure and analyzing what you find"

The problem is that, a visual analysis for what I find, it will take much time, if I have more and more  of files.

An other attention case, we can find: if the vaiable(V) is mixed : charcter and numeric.

V

1

2

..

6  /* line 32767*/

A1 /* line 32768*/

A2 /* line 32769*/

..

8  /* line 120000*/

 

 

Reeza
Super User

That's variable type. Determined the delimiter should be straightforward visually. 

 

Problem is a processor or will make the same mistake as you've illustrated. I just had a 5 column file where it wasn't reading in all of the text. 

 

Also, hoe do you plan to analyze information your unfamiliar with?

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 2022 views
  • 6 likes
  • 3 in conversation