BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AlexMoreton
Obsidian | Level 7

Hi,

I have a dataset with one variable that needs to be cleansed (it contains a unrecognized character). However, the variable itself is over 50K characters long... Therefore I need to split the variable down before i can completely clean it (I have successfully called it onto xml but it was only 17K long). I have found macros and codes that split them into columns. But ideally I would like to break them into chunks/files and cleanse them and then merge them up again (PROC SQL UNION seems to be the way for it).

Can anyone help me on this. I would really appreciate pointers or articles that might help me out.

Thank you.

This is the code I used to just look into the length of the variable. Ideally I want to split it 20k per file and then cleanse each file. 

 

Create table tmp as

       select * from connection to sqlserver

              (Select

                 substring(cast(bigfile_xml as nvarchar(max)),17000,17000) as XMLCol

              From

                     bigfile

           Where id = 1;

              );

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Why not just take it all at once using multiple variables?

create table tmp as
 select * from connection to sqlserver
  (select
      substring(cast(bigfile_xml as nvarchar(max)),1,17000) as XMLCol1
    , substring(cast(bigfile_xml as nvarchar(max)),17001,17000) as XMLCol2
    , substring(cast(bigfile_xml as nvarchar(max)),34001,17000) as XMLCol3
...
   from
      bigfile
  where id = 1;
   )
;

View solution in original post

7 REPLIES 7
RW9
Diamond | Level 26 RW9
Diamond | Level 26

??   Variables in SAS cannot be longer than 2000 characters.  Try starting by posting some example test data, in the form of a datastep of what you actually have.  Is it an XML file you are trying to read into SAS?  If so why not import into data direct from XML, then post process the data.  really can't follow what you trying to do.

AlexMoreton
Obsidian | Level 7

RW9  

The file I was given is a dataset with a string that has over 50K characters. I already created a map file to map the string out and allocate the characters to different variables. However the string is too large. I want to export the file into XML but whenever I tried exporting it I end up with only 17K characters.

Therefore I need to basically break it down into smaller chunks before I export it (unless my logic is incorrect. If so please point it out to me, thank you).

 

 

let BigfileC= CountC(Bigfile);              *CountC can be used to count characters instead of words.

let chunks = round(BigfileC/32000);   *i'm trying to round it up so I know how many chunks I need for it (each observation might have                                                                 different lengths)

Do I = 0 to chunks;                              * I'm thinking a DO LOOP would be a possible method for this splitting?

Text(I) = 

Substr(text,i*32000+1,(i+1)*32000) ;   * This is how I thought would help create the files.

Output Text(I)

End;

 

 

I know I might be wasting your time but please be a bit patient as I am just learning this language. Thank you. I really appreciate your help.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Nope, its no bother.  Am just trying to understand your starting point here:

"The file I was given is a dataset"

Is it a SAS dataset - i.e. .sas7bdat extension, as SAS has maximum variable length for strings as 2000.   Or is it another file, maybe a text file with end of line characters?  That would makes sense as the "line" effectively is the whole thing.  Judging by your use of the 32000 it seems you are dealing with a text file, if so you could just read it in one character at a time and concatenate that on the output;

data want;
  file 'abc.txt';
  length line $2000;
  input a $char1.;
  line=cat(line,a);
  if lengthn(line) > 1999 then do;
    output;
    line="";
  end;
run;

This will write out to a datasets lines of 2000 characters.

AlexMoreton
Obsidian | Level 7

RW9, Thanks for the advice!

I got the file from a SQL database and I think it's originally a text file (think that's why the xml).

I'll give it a go and let you know! 

Thank you

 

Tom
Super User Tom
Super User

What is in the these 70K strings?  From the name it looks like it contains some type of XML.  What is in the XML? What is the format?

You might have better luck using XML parsing in your SQL server to pull out the data fields from the XML string instead of just taking substrings that might break in the middle of an XML tag.

 

How many of these xml strings do you have?  How many observations are in your source table?

You could try writing them to a text file(s) and then trying to parse the XML from the file.

AhmedAl_Attar
Rhodochrosite | Level 12

Hi,

From reading your entire thread, it sounds like you have a Database Table with a column/variable of a CLOB data type containing the String you are trying to manipulate and cleans.

 

These two papers may give you some tips with how to handle such data.

View Microsoft Word RTF Objects Embedded In Oracle

- Using SAS® to Report Data in XML Format

 

Hope this helps,

Ahmed

Tom
Super User Tom
Super User

Why not just take it all at once using multiple variables?

create table tmp as
 select * from connection to sqlserver
  (select
      substring(cast(bigfile_xml as nvarchar(max)),1,17000) as XMLCol1
    , substring(cast(bigfile_xml as nvarchar(max)),17001,17000) as XMLCol2
    , substring(cast(bigfile_xml as nvarchar(max)),34001,17000) as XMLCol3
...
   from
      bigfile
  where id = 1;
   )
;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 2691 views
  • 1 like
  • 4 in conversation