Solved: Re: Proc Import- why guessingrows isn't guessing properly?

novinosrin · Posted 02-28-2020 08:54 AM

Proc Import- why guessingrows isn't guessing properly?


filename imp '\\Corp\sites\RIB1001\HLSCreditRisk\Paul\20200219_SBO_DiTech_ShellPoint_Transition\test_data\data__shellpoint_test_data\20150708_ClientName_StepInfo.csv';


proc import datafile=imp
        dbms=csv
        out=w
	
        replace;
		mixed=yes;
     getnames=yes;
	guessingrows=max;

run;

All columns including numerics, dates are being read as character? Any thoughts to fix this irritating behavior plz?

Tom · Posted 02-28-2020 10:59 AM

@novinosrin wrote:

Sir @ballardw The sample was a copy/paste of view table window from the output generated by proc import. That was basically to show Paige what the values look like.

To see what the values look like you need to open the file in a text editor and copy some of the lines. Use the Insert Code button to get a window to paste the lines so they are not re-formatted.

Here is code you can use to find the file names and check what delimiter the file is using.

data files ;
  length fileno 8 memname $32 nliteral $65 dlm $5 fname label path $256 ;
  infile "&path.*.csv" filename=fname ;
  input ;
  if fname ne lag(fname);
  fileno+1;
  path=fname ;
  label=scan(path,-1,'/\');
  memname = scan(label,1,'.');
  nliteral=nliteral(memname);
  loc=indexc(_infile_,'09'x,'|');
  if not loc then loc=indexc(_infile_,',');
  if loc then do;
    dlm=char(_infile_,loc);
    dlm=cats("'",put(dlm,$hex2.),"'x");
  end;
  drop loc;
run;

View solution in original post

nicobuettner · Posted 02-28-2020 08:56 AM

Can you provide a data sample of 20150708_ClientName_StepInfo.csv that shows the behavior?

PaigeMiller · Posted 02-28-2020 08:57 AM

Do the columns have NA or N/A or other character values in them (after the top row which should be read as the column name)?

--
Paige Miller

Tom · Posted 02-28-2020 08:59 AM

You asked the software to guess how to define the variables. If you know what they are why did you ask it to guess? Just write your own data step to read the data. You can even use the ugly code that PROC IMPORT generates as a starting point.

novinosrin · Posted 02-28-2020 10:00 AM

Thank you Sir @Tom and others for the response. Here is what I am trying to accomplish i.e. basically the objective.

I was assinged a task to read various sets of CSV/TXT files in a directory and convert these to SAS datasets. I suppose what they would do with these datasets, I don't know.

Each file has a different set of variables and is mutually exclusive. Therefore I envisioned an idea of

/*Reference the directory with a fileref*/
Filename filelist pipe "dir /b /s \\Corp\sites\RIB1001\HLSCreditRisk\Paul\20200219_SBO_DiTech_ShellPoint_Transition\test_data\data__shellpoint_test_data\*.csv";  

/*List all Txt files in the directory*/
data flist;                                        
 infile filelist truncover;
 input filename $1000.;
 dsn=scan(scan(filename,-1,'_'),1,'.');/*Get the suffix label names and use the same as dsn*/
run;

The using CALL EXECUTE, something like

data _null_ ;
 set flist;
 call execute(cat(' proc import datafile= ','"',filename,'"',
 ' ',' dbms=csv out= ',' ',dsn,' ',' replace ;'));
 call execute('   getnames=yes; ');
 call execute('   	guessingrows=max; ');
 call execute(' run; ');
run;

I thought of creating those independent unrelated datasets and be over with it.

Now, before attempting the fancy process as a while, I tested the Proc import that was posted at the top(question post) to see if that would read correctly. Yes I did fear the damn guesses that proc import does may not be accurate, however I was hoping an 80/20 (accurate/inaccurate) can be deemed acceptable, albeit all are read as chars to my disappointment

Sir @PaigeMiller and @nicobuettner the data looks something like this and I can't spot an N/A yet

Date ID .............and others

07/07/2015 0773045799 0.04125 1 02/01/2016 4.375
07/07/2015 0773045797 0.02125 0
07/07/2015 0773045794 0.05 0
07/07/2015 0773045791 0.03 0
07/07/2015 0773045788 0.05 0
07/07/2015 0773045787 0.02 1 12/01/2016 3 12/01/2017 4 12/01/2018 4.25
07/07/2015 0773045784 0.045 1 11/01/2015 5
07/07/2015 0773045779 0.025 0
07/07/2015 0773045776 0.02875 0
07/07/2015 0773045775 0.02875 0
07/07/2015 0773045773 0.02375 1 08/01/2016 3.375 08/01/2017 4.375 08/01/2018 4.625

I am not in a position to write a custom code considering the numerous files potentially the directory may contain in the months to come though numerous shouldn't be over 100 I would think, nonetheless 100+ different datastep code to read, and the fact going into each file to learn about the data values pattern would make it extremely time consuming.

Is my thinking reasonable?

PaigeMiller · Posted 02-28-2020 10:02 AM

The slashes cause PROC IMPORT to consider the entry to be character.

If you run PROC IMPORT (and get the wrong results, as we know), code is written to the log, and you can change the code so that instead of reading the columns as character, you instruct it to create a numeric variable by applying the proper informat.

--
Paige Miller

Tom · Posted 02-28-2020 10:27 AM

That data is NOT comma delimited. That is why PROC IMPORT is doing strange things with it.

Are those spaces or tabs between the fields? It is hard to tell after you have pasted the data into the forum.

Are all of the files using tabs as delimiters? Or do you need to do more work to figure out for each file what delimiter it is using?

ballardw · Posted 02-28-2020 10:34 AM

@novinosrin wrote:

Thank you Sir @Tom and others for the response. Here is what I am trying to accomplish i.e. basically the objective.

I was assinged a task to read various sets of CSV/TXT files in a directory and convert these to SAS datasets. I suppose what they would do with these datasets, I don't know.

Each file has a different set of variables and is mutually exclusive. Therefore I envisioned an idea of
/*Reference the directory with a fileref*/
Filename filelist pipe "dir /b /s \\Corp\sites\RIB1001\HLSCreditRisk\Paul\20200219_SBO_DiTech_ShellPoint_Transition\test_data\data__shellpoint_test_data\*.csv";  

/*List all Txt files in the directory*/
data flist;                                        
 infile filelist truncover;
 input filename $1000.;
 dsn=scan(scan(filename,-1,'_'),1,'.');/*Get the suffix label names and use the same as dsn*/
run; 
The using CALL EXECUTE, something like
data _null_ ;
 set flist;
 call execute(cat(' proc import datafile= ','"',filename,'"',
 ' ',' dbms=csv out= ',' ',dsn,' ',' replace ;'));
 call execute('   getnames=yes; ');
 call execute('   	guessingrows=max; ');
 call execute(' run; ');
run;
I thought of creating those independent unrelated datasets and be over with it.

Now, before attempting the fancy process as a while, I tested the Proc import that was posted at the top(question post) to see if that would read correctly. Yes I did fear the damn guesses that proc import does may not be accurate, however I was hoping an 80/20 (accurate/inaccurate) can be deemed acceptable, albeit all are read as chars to my disappointment

Sir @PaigeMiller and @nicobuettner the data looks something like this and I can't spot an N/A yet

Date ID .............and others

07/07/2015 0773045799 0.04125 1 02/01/2016 4.375
07/07/2015 0773045797 0.02125 0
07/07/2015 0773045794 0.05 0
07/07/2015 0773045791 0.03 0
07/07/2015 0773045788 0.05 0
07/07/2015 0773045787 0.02 1 12/01/2016 3 12/01/2017 4 12/01/2018 4.25
07/07/2015 0773045784 0.045 1 11/01/2015 5
07/07/2015 0773045779 0.025 0
07/07/2015 0773045776 0.02875 0
07/07/2015 0773045775 0.02875 0
07/07/2015 0773045773 0.02375 1 08/01/2016 3.375 08/01/2017 4.375 08/01/2018 4.625

I am not in a position to write a custom code considering the numerous files potentially the directory may contain in the months to come though numerous shouldn't be over 100 I would think, nonetheless 100+ different datastep code to read, and the fact going into each file to learn about the data values pattern would make it extremely time consuming.

Is my thinking reasonable?

That isn't CSV but apparently space delimited though possibly TAB delimited as the example wasn't posted in a text box so may have been reformatted by the message windows. I would be tempted to try DBMS=DLM with Delimiter=' ' to see if anything changes. Or DBMS=TAB

I would also use PROC Printto to redirect the LOG of all those Proc Import steps into a text file so you have something that you can modify when the case comes up that variable ABC needs to be the same type/length in different data sets

novinosrin · Posted 02-28-2020 10:47 AM

Sir @ballardw The sample was a copy/paste of view table window from the output generated by proc import. That was basically to show Paige what the values look like.

Tom · Posted 02-28-2020 10:59 AM

@novinosrin wrote:

Sir @ballardw The sample was a copy/paste of view table window from the output generated by proc import. That was basically to show Paige what the values look like.

To see what the values look like you need to open the file in a text editor and copy some of the lines. Use the Insert Code button to get a window to paste the lines so they are not re-formatted.

Here is code you can use to find the file names and check what delimiter the file is using.

data files ;
  length fileno 8 memname $32 nliteral $65 dlm $5 fname label path $256 ;
  infile "&path.*.csv" filename=fname ;
  input ;
  if fname ne lag(fname);
  fileno+1;
  path=fname ;
  label=scan(path,-1,'/\');
  memname = scan(label,1,'.');
  nliteral=nliteral(memname);
  loc=indexc(_infile_,'09'x,'|');
  if not loc then loc=indexc(_infile_,',');
  if loc then do;
    dlm=char(_infile_,loc);
    dlm=cats("'",put(dlm,$hex2.),"'x");
  end;
  drop loc;
run;

novinosrin · Posted 02-28-2020 11:24 AM

Sir @Tom wow wow wow. Brilliance personified showing the elegant thinking.

That gives me enough lead and direction to proceed. Cheers! got to have a pint for that. Thank you so much*1E6

Kurt_Bremser · Posted 02-28-2020 10:46 AM

If you have N files where the structure is not known and has to be guessed, you have N serious problems to fix. And if the structure for any given file is not available, how do you guess what to do with it anyway?

IMO, you are on the way to slowly torturing yourself into insanity. Tell your colleagues to run as soon as they see foam on your lips.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!