BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
novinosrin
Tourmaline | Level 20

Proc Import- why guessingrows isn't guessing properly?

 


filename imp '\\Corp\sites\RIB1001\HLSCreditRisk\Paul\20200219_SBO_DiTech_ShellPoint_Transition\test_data\data__shellpoint_test_data\20150708_ClientName_StepInfo.csv';


proc import datafile=imp
        dbms=csv
        out=w
	
        replace;
		mixed=yes;
     getnames=yes;
	guessingrows=max;

run;

All columns including numerics, dates are being read as character? Any thoughts to fix this irritating behavior plz?

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

@novinosrin wrote:

Sir @ballardw  The sample was a copy/paste of view table window from the output generated by proc import. That was basically to show Paige what the values look like. 


To see what the values look like you need to open the file in a text editor and copy some of the lines. Use the Insert Code button to get a window to paste the lines so they are not re-formatted.

 

Here is code you can use to find the file names and check what delimiter the file is using.

data files ;
  length fileno 8 memname $32 nliteral $65 dlm $5 fname label path $256 ;
  infile "&path.*.csv" filename=fname ;
  input ;
  if fname ne lag(fname);
  fileno+1;
  path=fname ;
  label=scan(path,-1,'/\');
  memname = scan(label,1,'.');
  nliteral=nliteral(memname);
  loc=indexc(_infile_,'09'x,'|');
  if not loc then loc=indexc(_infile_,',');
  if loc then do;
    dlm=char(_infile_,loc);
    dlm=cats("'",put(dlm,$hex2.),"'x");
  end;
  drop loc;
run;

View solution in original post

11 REPLIES 11
nicobuettner
SAS Employee

Can you provide a data sample of 20150708_ClientName_StepInfo.csv that shows the behavior?

PaigeMiller
Diamond | Level 26

Do the columns have NA or N/A or other character values in them (after the top row which should be read as the column name)?

--
Paige Miller
Tom
Super User Tom
Super User

You asked the software to guess how to define the variables.  If you know what they are why did you ask it to guess?  Just write your own data step to read the data.  You can even use the ugly code that PROC IMPORT generates as a starting point.

novinosrin
Tourmaline | Level 20

Thank you Sir @Tom  and others for the response. Here is what I am trying to accomplish  i.e. basically the objective.

 

I was assinged a task to read various sets of CSV/TXT files in a directory and convert these to SAS datasets. I suppose what they would do with these datasets, I don't know.

 

Each file has a different set of variables and is mutually exclusive. Therefore I envisioned an idea of 

 

/*Reference the directory with a fileref*/
Filename filelist pipe "dir /b /s \\Corp\sites\RIB1001\HLSCreditRisk\Paul\20200219_SBO_DiTech_ShellPoint_Transition\test_data\data__shellpoint_test_data\*.csv";  

/*List all Txt files in the directory*/
data flist;                                        
 infile filelist truncover;
 input filename $1000.;
 dsn=scan(scan(filename,-1,'_'),1,'.');/*Get the suffix label names and use the same as dsn*/
run; 

The using CALL EXECUTE, something like 

 

 

data _null_ ;
 set flist;
 call execute(cat(' proc import datafile= ','"',filename,'"',
 ' ',' dbms=csv out= ',' ',dsn,' ',' replace ;'));
 call execute('   getnames=yes; ');
 call execute('   	guessingrows=max; ');
 call execute(' run; ');
run;

 I thought of creating those independent unrelated datasets and be over with it.

 

 

Now, before attempting the fancy process as a while, I tested the Proc import that was posted at the top(question post) to see if that would read correctly. Yes I did fear the damn guesses that proc import does may not be accurate, however I was hoping an 80/20 (accurate/inaccurate)  can be deemed acceptable, albeit  all are read as chars to my disappointment

 

Sir @PaigeMiller   and @nicobuettner   the data looks something like this and I can't spot an N/A yet

Date   ID .............and others

 07/07/2015 0773045799 0.04125 1 02/01/2016 4.375
07/07/2015 0773045797 0.02125 0
07/07/2015 0773045794 0.05 0
07/07/2015 0773045791 0.03 0
07/07/2015 0773045788 0.05 0
07/07/2015 0773045787 0.02 1 12/01/2016 3 12/01/2017 4 12/01/2018 4.25
07/07/2015 0773045784 0.045 1 11/01/2015 5
07/07/2015 0773045779 0.025 0
07/07/2015 0773045776 0.02875 0
07/07/2015 0773045775 0.02875 0
07/07/2015 0773045773 0.02375 1 08/01/2016 3.375 08/01/2017 4.375 08/01/2018 4.625 

 

I am not in a position to write a custom code considering the numerous files potentially the directory may contain in the months to come though numerous shouldn't be over 100 I would think, nonetheless 100+ different datastep code to read, and the fact going into each file to learn about the data values pattern would make it extremely time consuming. 

 

Is my thinking reasonable?

 

 

PaigeMiller
Diamond | Level 26

The slashes cause PROC IMPORT to consider the entry to be character.

 

If you run PROC IMPORT (and get the wrong results, as we know), code is written to the log, and you can change the code so that instead of reading the columns as character, you instruct it to create a numeric variable by applying the proper informat.

--
Paige Miller
Tom
Super User Tom
Super User

That data is NOT comma delimited.  That is why PROC IMPORT is doing strange things with it.

 

Are those spaces or tabs between the fields? It is hard to tell after you have pasted the data into the forum. 

 

Are all of the files using tabs as delimiters?  Or do you need to do more work to figure out for each file what delimiter it is using?

ballardw
Super User

@novinosrin wrote:

Thank you Sir @Tom  and others for the response. Here is what I am trying to accomplish  i.e. basically the objective.

 

I was assinged a task to read various sets of CSV/TXT files in a directory and convert these to SAS datasets. I suppose what they would do with these datasets, I don't know.

 

Each file has a different set of variables and is mutually exclusive. Therefore I envisioned an idea of 

 

/*Reference the directory with a fileref*/
Filename filelist pipe "dir /b /s \\Corp\sites\RIB1001\HLSCreditRisk\Paul\20200219_SBO_DiTech_ShellPoint_Transition\test_data\data__shellpoint_test_data\*.csv";  

/*List all Txt files in the directory*/
data flist;                                        
 infile filelist truncover;
 input filename $1000.;
 dsn=scan(scan(filename,-1,'_'),1,'.');/*Get the suffix label names and use the same as dsn*/
run; 

The using CALL EXECUTE, something like 

 

 

data _null_ ;
 set flist;
 call execute(cat(' proc import datafile= ','"',filename,'"',
 ' ',' dbms=csv out= ',' ',dsn,' ',' replace ;'));
 call execute('   getnames=yes; ');
 call execute('   	guessingrows=max; ');
 call execute(' run; ');
run;

 I thought of creating those independent unrelated datasets and be over with it.

 

 

Now, before attempting the fancy process as a while, I tested the Proc import that was posted at the top(question post) to see if that would read correctly. Yes I did fear the damn guesses that proc import does may not be accurate, however I was hoping an 80/20 (accurate/inaccurate)  can be deemed acceptable, albeit  all are read as chars to my disappointment

 

Sir @PaigeMiller   and @nicobuettner   the data looks something like this and I can't spot an N/A yet

Date   ID .............and others

 07/07/2015 0773045799 0.04125 1 02/01/2016 4.375
07/07/2015 0773045797 0.02125 0
07/07/2015 0773045794 0.05 0
07/07/2015 0773045791 0.03 0
07/07/2015 0773045788 0.05 0
07/07/2015 0773045787 0.02 1 12/01/2016 3 12/01/2017 4 12/01/2018 4.25
07/07/2015 0773045784 0.045 1 11/01/2015 5
07/07/2015 0773045779 0.025 0
07/07/2015 0773045776 0.02875 0
07/07/2015 0773045775 0.02875 0
07/07/2015 0773045773 0.02375 1 08/01/2016 3.375 08/01/2017 4.375 08/01/2018 4.625 

 

I am not in a position to write a custom code considering the numerous files potentially the directory may contain in the months to come though numerous shouldn't be over 100 I would think, nonetheless 100+ different datastep code to read, and the fact going into each file to learn about the data values pattern would make it extremely time consuming. 

 

Is my thinking reasonable?

 

 


That isn't CSV but apparently space delimited though possibly TAB delimited as the example wasn't posted in a text box so may have been reformatted by the message windows. I would be tempted to try DBMS=DLM with Delimiter=' ' to see if anything changes. Or DBMS=TAB

 

I would also use PROC Printto to redirect the LOG of all those Proc Import steps into a text file so you have something that you can modify when the case comes up that variable ABC needs to be the same type/length in different data sets

novinosrin
Tourmaline | Level 20

Sir @ballardw  The sample was a copy/paste of view table window from the output generated by proc import. That was basically to show Paige what the values look like. 

Tom
Super User Tom
Super User

@novinosrin wrote:

Sir @ballardw  The sample was a copy/paste of view table window from the output generated by proc import. That was basically to show Paige what the values look like. 


To see what the values look like you need to open the file in a text editor and copy some of the lines. Use the Insert Code button to get a window to paste the lines so they are not re-formatted.

 

Here is code you can use to find the file names and check what delimiter the file is using.

data files ;
  length fileno 8 memname $32 nliteral $65 dlm $5 fname label path $256 ;
  infile "&path.*.csv" filename=fname ;
  input ;
  if fname ne lag(fname);
  fileno+1;
  path=fname ;
  label=scan(path,-1,'/\');
  memname = scan(label,1,'.');
  nliteral=nliteral(memname);
  loc=indexc(_infile_,'09'x,'|');
  if not loc then loc=indexc(_infile_,',');
  if loc then do;
    dlm=char(_infile_,loc);
    dlm=cats("'",put(dlm,$hex2.),"'x");
  end;
  drop loc;
run;
novinosrin
Tourmaline | Level 20

Sir @Tom  wow wow wow. Brilliance personified showing the elegant thinking. 

 

That gives me enough lead and direction to proceed. Cheers! got to have a pint for that. Thank you so much*1E6

Kurt_Bremser
Super User

If you have N files where the structure is not known and has to be guessed, you have N serious problems to fix. And if the structure for any given file is not available, how do you guess what to do with it anyway?

IMO, you are on the way to slowly torturing yourself into insanity. Tell your colleagues to run as soon as they see foam on your lips.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 2305 views
  • 6 likes
  • 6 in conversation