DATA Step, Macro, Functions and more

How do I split my data set into multiple files?

Accepted Solution Solved
Reply
Contributor
Posts: 30
Accepted Solution

How do I split my data set into multiple files?

Hi,

I have a dataset with one variable that needs to be cleansed (it contains a unrecognized character). However, the variable itself is over 50K characters long... Therefore I need to split the variable down before i can completely clean it (I have successfully called it onto xml but it was only 17K long). I have found macros and codes that split them into columns. But ideally I would like to break them into chunks/files and cleanse them and then merge them up again (PROC SQL UNION seems to be the way for it).

Can anyone help me on this. I would really appreciate pointers or articles that might help me out.

Thank you.

This is the code I used to just look into the length of the variable. Ideally I want to split it 20k per file and then cleanse each file. 

 

Create table tmp as

       select * from connection to sqlserver

              (Select

                 substring(cast(bigfile_xml as nvarchar(max)),17000,17000) as XMLCol

              From

                     bigfile

           Where id = 1;

              );


Accepted Solutions
Solution
‎01-30-2018 08:44 AM
Super User
Super User
Posts: 7,932

Re: How do I split my data set into multiple files?

Posted in reply to AlexMoreton

Why not just take it all at once using multiple variables?

create table tmp as
 select * from connection to sqlserver
  (select
      substring(cast(bigfile_xml as nvarchar(max)),1,17000) as XMLCol1
    , substring(cast(bigfile_xml as nvarchar(max)),17001,17000) as XMLCol2
    , substring(cast(bigfile_xml as nvarchar(max)),34001,17000) as XMLCol3
...
   from
      bigfile
  where id = 1;
   )
;

View solution in original post


All Replies
Super User
Super User
Posts: 9,402

Re: How do I split my data set into multiple files?

Posted in reply to AlexMoreton

??   Variables in SAS cannot be longer than 2000 characters.  Try starting by posting some example test data, in the form of a datastep of what you actually have.  Is it an XML file you are trying to read into SAS?  If so why not import into data direct from XML, then post process the data.  really can't follow what you trying to do.

Contributor
Posts: 30

Re: How do I split my data set into multiple files?

RW9  

The file I was given is a dataset with a string that has over 50K characters. I already created a map file to map the string out and allocate the characters to different variables. However the string is too large. I want to export the file into XML but whenever I tried exporting it I end up with only 17K characters.

Therefore I need to basically break it down into smaller chunks before I export it (unless my logic is incorrect. If so please point it out to me, thank you).

 

 

let BigfileC= CountC(Bigfile);              *CountC can be used to count characters instead of words.

let chunks = round(BigfileC/32000);   *i'm trying to round it up so I know how many chunks I need for it (each observation might have                                                                 different lengths)

Do I = 0 to chunks;                              * I'm thinking a DO LOOP would be a possible method for this splitting?

Text(I) = 

Substr(text,i*32000+1,(i+1)*32000) ;   * This is how I thought would help create the files.

Output Text(I)

End;

 

 

I know I might be wasting your time but please be a bit patient as I am just learning this language. Thank you. I really appreciate your help.

Super User
Super User
Posts: 9,402

Re: How do I split my data set into multiple files?

Posted in reply to AlexMoreton

Nope, its no bother.  Am just trying to understand your starting point here:

"The file I was given is a dataset"

Is it a SAS dataset - i.e. .sas7bdat extension, as SAS has maximum variable length for strings as 2000.   Or is it another file, maybe a text file with end of line characters?  That would makes sense as the "line" effectively is the whole thing.  Judging by your use of the 32000 it seems you are dealing with a text file, if so you could just read it in one character at a time and concatenate that on the output;

data want;
  file 'abc.txt';
  length line $2000;
  input a $char1.;
  line=cat(line,a);
  if lengthn(line) > 1999 then do;
    output;
    line="";
  end;
run;

This will write out to a datasets lines of 2000 characters.

Contributor
Posts: 30

Re: How do I split my data set into multiple files?

RW9, Thanks for the advice!

I got the file from a SQL database and I think it's originally a text file (think that's why the xml).

I'll give it a go and let you know! 

Thank you

 

Super User
Super User
Posts: 7,932

Re: How do I split my data set into multiple files?

Posted in reply to AlexMoreton

What is in the these 70K strings?  From the name it looks like it contains some type of XML.  What is in the XML? What is the format?

You might have better luck using XML parsing in your SQL server to pull out the data fields from the XML string instead of just taking substrings that might break in the middle of an XML tag.

 

How many of these xml strings do you have?  How many observations are in your source table?

You could try writing them to a text file(s) and then trying to parse the XML from the file.

Super Contributor
Posts: 276

Re: How do I split my data set into multiple files?

Posted in reply to AlexMoreton

Hi,

From reading your entire thread, it sounds like you have a Database Table with a column/variable of a CLOB data type containing the String you are trying to manipulate and cleans.

 

These two papers may give you some tips with how to handle such data.

View Microsoft Word RTF Objects Embedded In Oracle

- Using SAS® to Report Data in XML Format

 

Hope this helps,

Ahmed

Solution
‎01-30-2018 08:44 AM
Super User
Super User
Posts: 7,932

Re: How do I split my data set into multiple files?

Posted in reply to AlexMoreton

Why not just take it all at once using multiple variables?

create table tmp as
 select * from connection to sqlserver
  (select
      substring(cast(bigfile_xml as nvarchar(max)),1,17000) as XMLCol1
    , substring(cast(bigfile_xml as nvarchar(max)),17001,17000) as XMLCol2
    , substring(cast(bigfile_xml as nvarchar(max)),34001,17000) as XMLCol3
...
   from
      bigfile
  where id = 1;
   )
;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 317 views
  • 1 like
  • 4 in conversation