Solved: Create new variable by considering semicolon or spcae and split the da...

pdhokriya · Posted 10-12-2020 02:50 PM

Dear Reader,

I am not getting a correct output.

My requirement is col1 = hema; and col2 = ashu pallu and so on.

Space and latest semicolon should be consider while splitting the data.

**********************************************************

data abc;
old = "hema;ashu pallu;hema;ashu pallu;hema;
ashu pallu;hema;";output;
old = "usa;swati;sajju;priyanka;priyanka";output;
old = "hema;m?;ashu pallu;hema;ashu pallu;hema;
ashu palluhema;";output;
old = "My prog";output;
run;

%let w = 10 ;

data want (drop = _:) ;
set abc ;
array c $ &w col col1-col9 ;
/*if old^="" and length(strip(old)) le 10 then col=old; */
_i_ = 1 ;
do _x = 1 to countw (old, "; ") ;
c = catx ("; ", c, scan (old, _x, "; ")) ;
if length (c) <= &w and length (c) + length (scan (old, _x + 1)) > &w or _x = countw (old, "; ") then _i_ + 1 ;
end ;
run ;
************************************************************

Thank you in advance for your help. Take care.

Regards

Priyanka

Tom · Posted 10-12-2020 05:37 PM

Perhaps I am not understanding the issue but don't you just want to use SCAN()? If you want to know how many items there are use COUNTW().

First let's make some test data:

data have;
  input old $80.;
cards4;
hema;ashu pallu;hema;ashu pallu;hema;ashu pallu;hema
usa;swati;sajju;priyanka;priyanka
hema;m?;ashu pallu;hema;ashu pallu;hema;ashu palluhema
My prog
;;;;

Now let's run a program to create a lot of new variables to hold the individual words (tokens). The data step needs to know how many we are going to create. And how long to make each one. We can make it more flexible by setting those numbers in macro variables.

%let max=9;
%let width=10;

Now just read in the existing data and use an ARRAY and a DO loop.

data want ;
  set have ;
  array new [&max] $&width;
  do i=1 to countw(old,';');
    new[i] = scan(old,i,';');
  end;
  drop i;
run;

Obs   new1      new2         new3         new4         new5         new6         new7         new8   new9

 1    hema      ashu pallu   hema         ashu pallu   hema         ashu pallu   hema
 2    usa       swati        sajju        priyanka     priyanka
 3    hema      m?           ashu pallu   hema         ashu pallu   hema         ashu pallu
 4    My prog

If you want to treat space as a delimiter also then include it in the list of delimiters. You will need more than 9 variables.

%let width=10;
%let max=12;

data want ;
  set have ;
  array new [&max] $&width;
  do i=1 to countw(old,'; ');
    new[i] = scan(old,i,'; ');
  end;
  drop i;
run;

Obs   new1   new2    new3    new4       new5       new6    new7    new8   new9    new10       new11   new12

 1    hema   ashu    pallu   hema       ashu       pallu   hema    ashu   pallu   hema
 2    usa    swati   sajju   priyanka   priyanka
 3    hema   m?      ashu    pallu      hema       ashu    pallu   hema   ashu    palluhema
 4    My     prog

View solution in original post

jimbarbour · Posted 10-12-2020 04:48 PM

@pdhokriya

Priyanka, I think you may have better results with COUNTC just looking for the semi-colon than looking for the combination of a semi-colon and a blank. I think the end of line will be a problem otherwise.

Here's how I would code the solution. I will place the results with the Old string broken into string tokens below the code.

Regards,

Jim

data abc;
	DROP	_:;
	RETAIN	_Max_Tokens	0;

	old 			= "hema;ashu pallu;hema;ashu pallu;hema;ashu pallu;hema;";
	_Token_Cnt		=	COUNTC(Old, ';');
	IF	_Token_Cnt	>	_Max_Tokens	THEN
		_Max_Tokens	=	_Token_Cnt;
	output;

	old 			= "usa;swati;sajju;priyanka;priyanka";
	_Token_Cnt		=	COUNTC(Old, ';');
	IF	_Token_Cnt	>	_Max_Tokens	THEN
		_Max_Tokens	=	_Token_Cnt;
	output;

	old 			= "hema;m?;ashu pallu;hema;ashu pallu;hema;ashu palluhema;";
	_Token_Cnt		=	COUNTC(Old, ';');
	IF	_Token_Cnt	>	_Max_Tokens	THEN
		_Max_Tokens	=	_Token_Cnt;
	output;

	old 			= "My prog";
	_Token_Cnt		=	COUNTC(Old, ';');
	IF	_Token_Cnt	>	_Max_Tokens	THEN
		_Max_Tokens	=	_Token_Cnt;
	output;

	CALL	SYMPUTX('Max_Tokens', _Max_Tokens, 'G');
RUN;

%PUT	NOTE:  &=Max_Tokens;

DATA	Parsed_Data;
	DROP	_:;
	SET	abc;
	ARRAY	Tokens	[*]	$32	Token1	-	Token&Max_Tokens;

	_i				=	0;
	DO	_i			=	1	TO	&Max_Tokens;
		Tokens[_i]	=	SCAN(Old, _i, ';');
	END;
RUN;

Results:

pdhokriya · Posted 10-12-2020 10:44 PM

I can see obs 1 is fine but not the 2nd and 3rd obs.

Here usa;swati; is token1 as length is 10, if space in between 2data then should be split.

jimbarbour · Posted 10-13-2020 02:54 AM

@pdhokriya,

Here are the results that I have from a program that I have written:

I believe these results are correct based on a $10 restriction. I inserted an extra record (record 4) as follows:

Old = "hema;m?;ashu PROG;hema;ashu pallu;hema;ashu palluhema;";

I inserted this record because I wanted a 5 byte token ("ashu ") and another 5 byte token ("PROG;") so that I could test that the space would be preserved in the output. It was preserved.

Code is below.

Jim

DATA	abc;
	DROP	_:;
	LENGTH	Old	$256.;
	RETAIN	_Max_Tokens	0;

	Old = 	"hema;ashu pallu;hema;ashu pallu;hema;ashu pallu;hema;";
	LINK	Count_Tokens;

	Old = 	"usa;swati;sajju;priyanka;priyanka";
	LINK	Count_Tokens;

	Old = 	"hema;m?;ashu pallu;hema;ashu pallu;hema;ashu palluhema;";
	LINK	Count_Tokens;

	Old = 	"hema;m?;ashu PROG;hema;ashu pallu;hema;ashu palluhema;";
	LINK	Count_Tokens;

	Old = 	"My prog";
	LINK	Count_Tokens;

	CALL	SYMPUTX('Max_Tokens', _Max_Tokens, 'G');

	******;
	RETURN;
	******;

	************;
	Count_Tokens:
	************;
		IF	SUBSTR(Old, LENGTHN(Old), 1)		^=	';'	THEN
			Old									=	CATS(Old, ';');
		_Token_Cnt								=	COUNTC(TRANSTRN(STRIP(Old), ' ', ';'), ';');
		IF	_Token_Cnt							>	_Max_Tokens	THEN
			_Max_Tokens							=	_Token_Cnt;
		OLD										=	TRANSTRN(STRIP(Old), ' ', '~');
		OUTPUT;
	******;
	RETURN;
	******;
RUN;

%PUT	NOTE:  &=Max_Tokens;

DATA	Parsed_Data;
	DROP	_:;
	SET	abc;
	ARRAY	Tokens	[*]	$10		New1	-	New&Max_Tokens;

	_Token_Length								=	10;
	_Token_Nbr									=	1;
	_Total_Length								=	LENGTHN(Old);
	_Current_Position							=	1;
	_Current_Length								=	0;
	_End_of_Record								=	0;
	_Prev_Delimiter								=	0;
	_Next_Delimiter								=	0;

	LINK	Get_Next_Delimiter;

	DO	WHILE	(_Current_Position				<	_Total_Length);
		Tokens[_Token_Nbr]						=	SUBSTR(Old, _Current_Position, (_Next_Delimiter - _Prev_Delimiter));

		_Current_Length							=	LENGTH(Tokens[_Token_Nbr]);

		_Current_Position						=	_Current_Position + (_Next_Delimiter - _Prev_Delimiter);
		_Prev_Delimiter							=	_Next_Delimiter;
		LINK	Get_Next_Delimiter;

		IF	NOT	_End_Of_Record															AND
			_Next_Delimiter - _Prev_Delimiter	<=	_Token_Length	-	_Current_Length	THEN
			DO;
				Tokens[_Token_Nbr]				=	CAT(STRIP(Tokens[_Token_Nbr]), SUBSTR(Old, _Current_Position, (_Next_Delimiter - _Prev_Delimiter)));
				_Current_Position				=	(_Current_Position + _Next_Delimiter - _Prev_Delimiter);
				_Prev_Delimiter					=	_Next_Delimiter;
				LINK	Get_Next_Delimiter;
			END;

		_Token_Nbr								+	1;
	END;

	DO	_j										=	1	TO	DIM(Tokens);
		Tokens[_j]								=	TRANSTRN(Tokens[_j], '~', ' ');
	END;

	******;
	RETURN;
	******;

	******************;
	Get_Next_Delimiter:
	******************;
		IF	_Current_Position					<=	_Total_Length		THEN
			_Next_Delimiter						=	0;
		ELSE
			_End_of_Record						=	1;

		DO	_i	=	_Current_Position			TO	_Total_Length;
			_Current_Byte						=	SUBSTR(Old, _i, 1);
			IF	SUBSTR(Old, _i, 1)				=	'~'					THEN
				DO;
					_Next_Delimiter				=	_i;
					_i							=	_Total_Length;
				END;
			ELSE
			IF	SUBSTR(Old, _i, 1)				=	';'					THEN
				DO;
					_Next_Delimiter				=	_i;
					_i							=	_Total_Length;
				END;
		END;

	******;
	RETURN;
	******;
RUN;

Tom · Posted 10-12-2020 05:37 PM

Perhaps I am not understanding the issue but don't you just want to use SCAN()? If you want to know how many items there are use COUNTW().

First let's make some test data:

data have;
  input old $80.;
cards4;
hema;ashu pallu;hema;ashu pallu;hema;ashu pallu;hema
usa;swati;sajju;priyanka;priyanka
hema;m?;ashu pallu;hema;ashu pallu;hema;ashu palluhema
My prog
;;;;

Now let's run a program to create a lot of new variables to hold the individual words (tokens). The data step needs to know how many we are going to create. And how long to make each one. We can make it more flexible by setting those numbers in macro variables.

%let max=9;
%let width=10;

Now just read in the existing data and use an ARRAY and a DO loop.

data want ;
  set have ;
  array new [&max] $&width;
  do i=1 to countw(old,';');
    new[i] = scan(old,i,';');
  end;
  drop i;
run;

Obs   new1      new2         new3         new4         new5         new6         new7         new8   new9

 1    hema      ashu pallu   hema         ashu pallu   hema         ashu pallu   hema
 2    usa       swati        sajju        priyanka     priyanka
 3    hema      m?           ashu pallu   hema         ashu pallu   hema         ashu pallu
 4    My prog

If you want to treat space as a delimiter also then include it in the list of delimiters. You will need more than 9 variables.

%let width=10;
%let max=12;

data want ;
  set have ;
  array new [&max] $&width;
  do i=1 to countw(old,'; ');
    new[i] = scan(old,i,'; ');
  end;
  drop i;
run;

Obs   new1   new2    new3    new4       new5       new6    new7    new8   new9    new10       new11   new12

 1    hema   ashu    pallu   hema       ashu       pallu   hema    ashu   pallu   hema
 2    usa    swati   sajju   priyanka   priyanka
 3    hema   m?      ashu    pallu      hema       ashu    pallu   hema   ashu    palluhema
 4    My     prog

pdhokriya · Posted 10-12-2020 10:08 PM

Thank you for the solution, But my requirement is 10 length with or without space and colon. e.g. at 8th and 9th position data is present after semicolon or space then it should go in next colon (in case length = 10)

obs1 should be new1 = hema; and new2=ashu pallu
obs2 should be new1= usa;swati; and new2=sujju;
obs3 should be new1=hema;m?; and new2=ashu pallu

Tom · Posted 10-13-2020 08:05 AM

@pdhokriya wrote:
Thank you for the solution, But my requirement is 10 length with or without space and colon. e.g. at 8th and 9th position data is present after semicolon or space then it should go in next colon (in case length = 10)

obs1 should be new1 = hema; and new2=ashu pallu
obs2 should be new1= usa;swati; and new2=sujju;
obs3 should be new1=hema;m?; and new2=ashu pallu

If you want to break the string in 10 character pieces just use SUBSTR().

new[i] = substr(old,1+(i-1)*10);

If don't want that then explain the rules.

Tom · Posted 10-13-2020 08:56 AM

If you use FINDC() you can search backwards from the end of the string to find last ';'.

Looks like you also want to remove semi-colons that follow the 10 character substring.

data tall ;
  set have ;
  length new $%eval(&width+1) ;
  copy=old;
  do col=1 by 1 until(copy=' ');
    new=copy;
    loc = findc(new,';',-10);
    if loc=0 then loc=&width;
    copy = substrn(copy,loc+1+(substrn(copy,loc+1,1)=';'));
    new = substrn(new,1,loc);
    if col=1 or new ne ' ' then output;
  end;
run;

proc transpose data=tall out=want (drop=_name_);
  by row old;
  var new ;
run;

Results:

Obs    row    old                                                        COL1

 1      1     hema;ashu pallu;hema;ashu pallu;hema;ashu pallu;hema;      hema;
 2      2     usa;swati;sajju;priyanka;priyanka                          usa;swati;
 3      3     hema;m?;ashu pallu;hema;ashu pallu;hema;ashu palluhema;    hema;m?;
 4      4     My prog                                                    My prog

Obs    COL2          COL3            COL4       COL5        COL6       COL7

 1     ashu pallu    hema;        ashu pallu    hema;    ashu pallu    hema;
 2     sajju;        priyanka;    priyanka
 3     ashu pallu    hema;        ashu pallu    hema;    ashu pallu    hema;
 4

Create new variable by considering semicolon or spcae and split the data

Re: Create new variable by considering semicolon or spcae and split the data

Re: Create new variable by considering semicolon or spcae and split the data

Re: Create new variable by considering semicolon or spcae and split the data

Re: Create new variable by considering semicolon or spcae and split the data

Re: Create new variable by considering semicolon or spcae and split the data

Re: Create new variable by considering semicolon or spcae and split the data

Re: Create new variable by considering semicolon or spcae and split the data

Re: Create new variable by considering semicolon or spcae and split the data

Registration is open

SAS Training: Just a Click Away