Hi,
How to import special characters (HIdden quotes,double quotes, next line) string around 2000 length in one of the column in csv file.(attached)
Cust_id | trans_type | gender | trans_date | trans_amt | Comments |
11201 | Entertainment | F | ######## | $50.65 | I Talked to customer, but he is not interested |
11205 | Grocery | m | ######## | $90.00 | 1. he is not wiling but he is will continue in future 2. he is good |
11206 | Travel | m | ######## | $23.00 | "not wiiling to continue" regards Diip |
11206 | Travel | m | ######## | $23.00 | "not talked @@ regards hiip |
Yuck.
This is one of the few cases where I say keep the data in Excel since it will keep,relevant content in a single cell.
Note this assumes you started with an Excel file since going to Excel from a CSV will have the same issues.
unfortunately we dont have access to import excel files. we need to convert either csv or text file.
this is just sample data i have unstructured data with so much cooments wecan't controal that column..i need to load this data into a dataset.do you have any recommendations how can i import?
Hi,
I created the attached txt file, simulating an export from Excel to CSV.
I imported it with this code:
data test;
infile "\\UNC path to my file\test.txt" delimiter="09"x termstr="|" dsd;
length col1 col2 $1000;
input col1 col2;
run;
proc print;
run;
If your data has embedded CRLF, then use termstr to change the character that SAS thinks marks the end-of-record.
Note: usually Excel doesn't use CRLF to embed line feeds, but only LF (I think it's "0A"x). So, you may not have to use termstr= or the pipe symbol to mark your end-of-record.
I have a Powershell script which runs a query against an Excel file and returns it as a CSV stream. You can use an infile pipe to dynamically read the stream in a data step.
You could modify it to write out your custom termstr character at the end of each line (something like SELECT [*,"|"] FROM [Sheet1$]).
Or, use another technique, such as adding the end-of-record character in your Excel file itself.
If you want to try the Powershell script, I don't have time to support further questions, but Google and the SAS documentation are your friend.
Here's the link: https://github.com/scottbass/Powershell/blob/master/Scripts/Query-Excel.ps1
Hope this helps...
You are luck. The termstr is CRLF . Others are CR . after I checked by NotePad++ . filename x '/folders/myfolders/comments.csv' termstr=crlf; proc import datafile=x out=want dbms=csv replace;run;
You are luck. The termstr is CRLF . Others are CR . after I checked by NotePad++ . filename x '/folders/myfolders/comments.csv' termstr=crlf; proc import datafile=x out=want dbms=csv replace; guessingrows=32767; run;
I don't see any attached file.
Why do you say that you cannot read the Excel file? SAS can directly read both XLS and XLSX files.
libname mydata xlsx 'myfile.xlsx';
proc copy inlib=mydata outlib=work;
run;
If you have a CSV file generated by Excel then you can probably read it into SAS by making sure to tell SAS to use CRLF as the end of line markers. That way it will read the embedded CR that Excel normally uses for line breaks inside of cells.
data want ;
infile 'myfile.csv' dsd truncover lrecl=100000 truncover firstobs=2 termstr=crlf;
length col1-col6 $32767;
input col1-col6;
run;
i have successfully able to import the file. now i want to strip the words in text. below code works fine for small datset.
however i have large datset it's taking too much time.. anyone have any suggestions to improve the code. Thanks
data have1;
set have(keep=comments);
count =countw(comments," ");
run;
data have2(drop=count i);
set have1;
do i=1 to count;
words = scan(comments,i ," ");
output;
end;
run;
You can combine these two data step into one : data have2(drop=count i); set have; do i=1 to countw(comments," "); words = scan(comments,i ," "); output; end; run;
This is not reducing my run time.
Can we write a similar program using PRX fncitons. that would help in running time?
Define takes too much time?
Is this a process that needs to be repeated every week? Month?
the dataset i have around 500k comments...i want to strip the comments into separate words..then i need to count how many times each word apperead. the code used above is taking too much time and generating around 100gb data..so i want to reduce the time and space.
after some research i came to know that i have intially assigend length for coments variable $5000 so that all comments is loaded proerly into my dataset. i think this is taking my dataset around 100gb
below is the the output i am getting after running the code, lf you see for every word its taking comments ..is there anyway we can improve the code. or how can i know max length of the variable before loading to dataset so that i can assign that length to the variable? or can i get the similar results another way? I hope you understand the issue. Thanks
comments | words |
I Talked to customer | I |
I Talked to customer | Talked |
I Talked to customer | to |
I Talked to customer | customer |
Have you considered SAS text analytics rather than using Base SAS?
dont have license to SAS text analytics 😞
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.