BookmarkSubscribeRSS Feed
FJ8
Calcite | Level 5 FJ8
Calcite | Level 5

Hello SAS Community, 

I was trying to import the below file after saving it as .xlsx but kept getting the below error while importing the file to SAS Enterprise Miner 14.1. I have tried the following but did not help: 

1. Tried saving to older version of excel, .xls. 

2. Copy & paste the data to a new worksheet. 

3. imported the file as .csv but only got 7 observations. 

 

The original csv file was downloaded from Kaggle webpage: https://www.kaggle.com/nikhileswarkomati/suicide-watch  

Could anybody help? Many thanks in advanced for this. 🙂 

 

FJ8_0-1644807164082.png

 

4 REPLIES 4
ballardw
Super User

I would suggest starting over, downloading the file as CSV and then reading that file.

If you open CSV files in Excel and save them you can change contents, sometimes quite significantly.

FJ8
Calcite | Level 5 FJ8
Calcite | Level 5

Thank you for your reply. i have tried that too but did not work neither. 😣

Tom
Super User Tom
Super User

Do NOT open a CSV file with EXCEL.  It will change the contents on you.

 

That file is not data.  From the header line there appear to be only three fields, but the longest lines are over 70 thousands bytes long. Also it looks like the first field is supposed to be a record number or identifier, but some of the lines (see lines 6,000 and 20,000 below) do not start with a digit string followed by a comma.  So most likely there are extra line feeds inserted into some of the fields.

101   data _null_;
102     infile 'c:\downloads\Suicide_Detection.csv.zip' zip member='*' lrecl=1000000 end=eof termstr=lf;
103     input;
104     if _N_ in (1,2,3,6000,20000) or eof then list;
105   run;

NOTE: The infile 'c:\downloads\Suicide_Detection.csv.zip' is:
      Filename=c:\downloads\Suicide_Detection.csv.zip,
      Member Name=*

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         ,text,class 11
2         2,"Ex Wife Threatening SuicideRecently I left my wife for good because she has cheated on me twice a
     101  nd lied to me so much that I have decided to refuse to go back to her. As of a few days ago, she beg
     201  an threatening suicide. I have tirelessly spent these paat few days talking her out of it and she ke
     301  eps hesitating because she wants to believe I'll come back. I know a lot of people will threaten thi
     401  s in order to get their way, but what happens if she really does? What do I do and how am I supposed
     501   to handle her death on my hands? I still love my wife but I cannot deal with getting cheated on aga
     601  in and constantly feeling insecure. I'm worried today may be the day she does it and I hope so much
     701  it doesn't happen.",suicide 727
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
3         3,Am I weird I don't get affected by compliments if it's coming from someone I know irl but I feel r
     101  eally good when internet strangers do it,non-suicide 152
6000      chest pain 10
20000      0
1025022   348110,I still haven't beaten the first boss in Hollow Knight. I've only fought it a few times and I
     101   always die really early in the fight. I'm terrible at this game y'all. :(,non-suicide 186
NOTE: 1025022 records were read from the infile 'c:\downloads\Suicide_Detection.csv.zip'.
      The minimum record length was 0.
      The maximum record length was 72599.

So the first thing to do is fix the embedded linefeeds.  You can use this %replace_crlf() macro.

Then perhaps the data mining tool can read really long strings?  Does it use VIYA or SAS?

filename csv temp;
%replace_crlf('c:\downloads\Suicide_Detection.csv.zip' zip member='*',csv)

If you have to read it with SAS then you will need to read the lines into multiple records of shorter strings.  The maximum length of a character variable in SAS is only 32,767 bytes.  So you cannot read that text into a single field.

 

To tease out the "TYPE" field at the end of those long lines will be a little tricky.

data suicide ;
  infile csv dsd truncover length=ll firstobs=2 termstr=lf lrecl=1000000;
  length recno length index 8 string $500 ;
  input recno @;
  length=ll ;
  n=ceil(length/500);
  do index=1 to n;
    input string $char500. @;
    output;
  end;
run;

data type;
 set suicide;
 where index>=(n-1);
 by recno;
 length type $12 string2 $1000;
 string2=cats(lag(string),string);
 if last.recno;
 type=scan(string2,-1,',');
 keep recno type;
run;

data suicide;
  merge type suicide;
  by recno;
run;

Results:

NOTE: The infile CSV is:
      Filename=C:\Users\ABERNA~1\AppData\Local\Temp\1\SAS Temporary Files\_TD12972_AMRL20L6F1E4992_\#LN00061,
      RECFM=V,LRECL=1000000,
      File Size (bytes)=167696307,
      Last Modified=13Feb2022:23:26:56,
      Create Time=13Feb2022:23:26:23

NOTE: 232074 records were read from the infile CSV.
      The minimum record length was 17.
      The maximum record length was 117797.
NOTE: The data set WORK.SUICIDE has 465570 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.68 seconds
      cpu time            0.54 seconds


104
105   data type;
106    set suicide;
107    where index>=(n-1);
108    by recno;
109    length type $12 string2 $1000;
110    string2=cats(lag(string),string);
111    if last.recno;
112    type=scan(string2,-1,',');
113    keep recno type;
114   run;

NOTE: There were 321711 observations read from the data set WORK.SUICIDE.
      WHERE index>=(n-1);
NOTE: The data set WORK.TYPE has 232074 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.47 seconds
      cpu time            0.45 seconds


115
116   data suicide;
117     merge type suicide;
118     by recno;
119   run;

NOTE: There were 232074 observations read from the data set WORK.TYPE.
NOTE: There were 465570 observations read from the data set WORK.SUICIDE.
NOTE: The data set WORK.SUICIDE has 465570 observations and 6 variables.
NOTE: DATA statement used (Total process time):
      real time           0.61 seconds
      cpu time            0.50 seconds


120
121
122
123   proc freq data=suicide;
124     where index=1;
125     tables type;
126   run;

NOTE: There were 232074 observations read from the data set WORK.SUICIDE.
      WHERE index=1;
NOTE: PROCEDURE FREQ used (Total process time):
      real time           0.40 seconds
      cpu time            0.17 seconds
The FREQ Procedure

                                        Cumulative    Cumulative
type           Frequency     Percent     Frequency      Percent
----------------------------------------------------------------
non-suicide      116037       50.00        116037        50.00
suicide          116037       50.00        232074       100.00

Now you at least have the strings in SAS variables.  How to make any sense of them is up to you.

I will leave as an exercise how to remove the type from the ends of the last one or two values of STRING for each RECNO.

 

FJ8
Calcite | Level 5 FJ8
Calcite | Level 5

Thank you Tom for the sharing. Let me try to clean those rows and see if it works. 🤞

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 944 views
  • 0 likes
  • 3 in conversation