Proc Import- why guessingrows isn't guessing properly?
filename imp '\\Corp\sites\RIB1001\HLSCreditRisk\Paul\20200219_SBO_DiTech_ShellPoint_Transition\test_data\data__shellpoint_test_data\20150708_ClientName_StepInfo.csv';
proc import datafile=imp
dbms=csv
out=w
replace;
mixed=yes;
getnames=yes;
guessingrows=max;
run;
All columns including numerics, dates are being read as character? Any thoughts to fix this irritating behavior plz?
@novinosrin wrote:
Sir @ballardw The sample was a copy/paste of view table window from the output generated by proc import. That was basically to show Paige what the values look like.
To see what the values look like you need to open the file in a text editor and copy some of the lines. Use the Insert Code button to get a window to paste the lines so they are not re-formatted.
Here is code you can use to find the file names and check what delimiter the file is using.
data files ;
length fileno 8 memname $32 nliteral $65 dlm $5 fname label path $256 ;
infile "&path.*.csv" filename=fname ;
input ;
if fname ne lag(fname);
fileno+1;
path=fname ;
label=scan(path,-1,'/\');
memname = scan(label,1,'.');
nliteral=nliteral(memname);
loc=indexc(_infile_,'09'x,'|');
if not loc then loc=indexc(_infile_,',');
if loc then do;
dlm=char(_infile_,loc);
dlm=cats("'",put(dlm,$hex2.),"'x");
end;
drop loc;
run;
Can you provide a data sample of 20150708_ClientName_StepInfo.csv that shows the behavior?
Do the columns have NA or N/A or other character values in them (after the top row which should be read as the column name)?
You asked the software to guess how to define the variables. If you know what they are why did you ask it to guess? Just write your own data step to read the data. You can even use the ugly code that PROC IMPORT generates as a starting point.
Thank you Sir @Tom and others for the response. Here is what I am trying to accomplish i.e. basically the objective.
I was assinged a task to read various sets of CSV/TXT files in a directory and convert these to SAS datasets. I suppose what they would do with these datasets, I don't know.
Each file has a different set of variables and is mutually exclusive. Therefore I envisioned an idea of
/*Reference the directory with a fileref*/
Filename filelist pipe "dir /b /s \\Corp\sites\RIB1001\HLSCreditRisk\Paul\20200219_SBO_DiTech_ShellPoint_Transition\test_data\data__shellpoint_test_data\*.csv";
/*List all Txt files in the directory*/
data flist;
infile filelist truncover;
input filename $1000.;
dsn=scan(scan(filename,-1,'_'),1,'.');/*Get the suffix label names and use the same as dsn*/
run;
The using CALL EXECUTE, something like
data _null_ ;
set flist;
call execute(cat(' proc import datafile= ','"',filename,'"',
' ',' dbms=csv out= ',' ',dsn,' ',' replace ;'));
call execute(' getnames=yes; ');
call execute(' guessingrows=max; ');
call execute(' run; ');
run;
I thought of creating those independent unrelated datasets and be over with it.
Now, before attempting the fancy process as a while, I tested the Proc import that was posted at the top(question post) to see if that would read correctly. Yes I did fear the damn guesses that proc import does may not be accurate, however I was hoping an 80/20 (accurate/inaccurate) can be deemed acceptable, albeit all are read as chars to my disappointment
Sir @PaigeMiller and @nicobuettner the data looks something like this and I can't spot an N/A yet
Date ID .............and others
07/07/2015 0773045799 0.04125 1 02/01/2016 4.375
07/07/2015 0773045797 0.02125 0
07/07/2015 0773045794 0.05 0
07/07/2015 0773045791 0.03 0
07/07/2015 0773045788 0.05 0
07/07/2015 0773045787 0.02 1 12/01/2016 3 12/01/2017 4 12/01/2018 4.25
07/07/2015 0773045784 0.045 1 11/01/2015 5
07/07/2015 0773045779 0.025 0
07/07/2015 0773045776 0.02875 0
07/07/2015 0773045775 0.02875 0
07/07/2015 0773045773 0.02375 1 08/01/2016 3.375 08/01/2017 4.375 08/01/2018 4.625
I am not in a position to write a custom code considering the numerous files potentially the directory may contain in the months to come though numerous shouldn't be over 100 I would think, nonetheless 100+ different datastep code to read, and the fact going into each file to learn about the data values pattern would make it extremely time consuming.
Is my thinking reasonable?
The slashes cause PROC IMPORT to consider the entry to be character.
If you run PROC IMPORT (and get the wrong results, as we know), code is written to the log, and you can change the code so that instead of reading the columns as character, you instruct it to create a numeric variable by applying the proper informat.
That data is NOT comma delimited. That is why PROC IMPORT is doing strange things with it.
Are those spaces or tabs between the fields? It is hard to tell after you have pasted the data into the forum.
Are all of the files using tabs as delimiters? Or do you need to do more work to figure out for each file what delimiter it is using?
@novinosrin wrote:
Thank you Sir @Tom and others for the response. Here is what I am trying to accomplish i.e. basically the objective.
I was assinged a task to read various sets of CSV/TXT files in a directory and convert these to SAS datasets. I suppose what they would do with these datasets, I don't know.
Each file has a different set of variables and is mutually exclusive. Therefore I envisioned an idea of
/*Reference the directory with a fileref*/ Filename filelist pipe "dir /b /s \\Corp\sites\RIB1001\HLSCreditRisk\Paul\20200219_SBO_DiTech_ShellPoint_Transition\test_data\data__shellpoint_test_data\*.csv"; /*List all Txt files in the directory*/ data flist; infile filelist truncover; input filename $1000.; dsn=scan(scan(filename,-1,'_'),1,'.');/*Get the suffix label names and use the same as dsn*/ run;
The using CALL EXECUTE, something like
data _null_ ; set flist; call execute(cat(' proc import datafile= ','"',filename,'"', ' ',' dbms=csv out= ',' ',dsn,' ',' replace ;')); call execute(' getnames=yes; '); call execute(' guessingrows=max; '); call execute(' run; '); run;
I thought of creating those independent unrelated datasets and be over with it.
Now, before attempting the fancy process as a while, I tested the Proc import that was posted at the top(question post) to see if that would read correctly. Yes I did fear the damn guesses that proc import does may not be accurate, however I was hoping an 80/20 (accurate/inaccurate) can be deemed acceptable, albeit all are read as chars to my disappointment
Sir @PaigeMiller and @nicobuettner the data looks something like this and I can't spot an N/A yet
Date ID .............and others
07/07/2015 0773045799 0.04125 1 02/01/2016 4.375
07/07/2015 0773045797 0.02125 0
07/07/2015 0773045794 0.05 0
07/07/2015 0773045791 0.03 0
07/07/2015 0773045788 0.05 0
07/07/2015 0773045787 0.02 1 12/01/2016 3 12/01/2017 4 12/01/2018 4.25
07/07/2015 0773045784 0.045 1 11/01/2015 5
07/07/2015 0773045779 0.025 0
07/07/2015 0773045776 0.02875 0
07/07/2015 0773045775 0.02875 0
07/07/2015 0773045773 0.02375 1 08/01/2016 3.375 08/01/2017 4.375 08/01/2018 4.625
I am not in a position to write a custom code considering the numerous files potentially the directory may contain in the months to come though numerous shouldn't be over 100 I would think, nonetheless 100+ different datastep code to read, and the fact going into each file to learn about the data values pattern would make it extremely time consuming.
Is my thinking reasonable?
That isn't CSV but apparently space delimited though possibly TAB delimited as the example wasn't posted in a text box so may have been reformatted by the message windows. I would be tempted to try DBMS=DLM with Delimiter=' ' to see if anything changes. Or DBMS=TAB
I would also use PROC Printto to redirect the LOG of all those Proc Import steps into a text file so you have something that you can modify when the case comes up that variable ABC needs to be the same type/length in different data sets
Sir @ballardw The sample was a copy/paste of view table window from the output generated by proc import. That was basically to show Paige what the values look like.
@novinosrin wrote:
Sir @ballardw The sample was a copy/paste of view table window from the output generated by proc import. That was basically to show Paige what the values look like.
To see what the values look like you need to open the file in a text editor and copy some of the lines. Use the Insert Code button to get a window to paste the lines so they are not re-formatted.
Here is code you can use to find the file names and check what delimiter the file is using.
data files ;
length fileno 8 memname $32 nliteral $65 dlm $5 fname label path $256 ;
infile "&path.*.csv" filename=fname ;
input ;
if fname ne lag(fname);
fileno+1;
path=fname ;
label=scan(path,-1,'/\');
memname = scan(label,1,'.');
nliteral=nliteral(memname);
loc=indexc(_infile_,'09'x,'|');
if not loc then loc=indexc(_infile_,',');
if loc then do;
dlm=char(_infile_,loc);
dlm=cats("'",put(dlm,$hex2.),"'x");
end;
drop loc;
run;
Sir @Tom wow wow wow. Brilliance personified showing the elegant thinking.
That gives me enough lead and direction to proceed. Cheers! got to have a pint for that. Thank you so much*1E6
If you have N files where the structure is not known and has to be guessed, you have N serious problems to fix. And if the structure for any given file is not available, how do you guess what to do with it anyway?
IMO, you are on the way to slowly torturing yourself into insanity. Tell your colleagues to run as soon as they see foam on your lips.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.