Skip to content Skip to sidebar Skip to footer

Split String By Html Entities?

My string contain a lot of HTML entities, like this "Hello <everybody> there" And I want to split it by HTML entities into thi

Solution 1:

It looks like you can just split on &[^;]*; regex. That is, the delimiter are strings that starts with &, ends with ;, and in between there can be anything but ;.

If you can have multiple delimiters in a row, and you don't want the empty strings between them, just use (&[^;]*;)+ (or in general (delim)+ pattern).

If you can have delimiters in the beginning or front of the string, and you don't want them the empty strings caused by them, then just trim them away before you split.


Example

Here's a snippet to demonstrate the above ideas (see also on ideone.com):

var s = "&#x22;Hello&nbsp;&lt;everybody&gt;&nbsp;there&#x22;"

print (s.split(/&[^;]*;/));
// ,Hello,,everybody,,there,

print (s.split(/(?:&[^;]*;)+/));
// ,Hello,everybody,there,

print (
   s.replace(/^(?:&[^;]*;)+/, "")
    .replace(/(?:&[^;]*;)+$/, "")
    .split(/(?:&[^;]*;)+/)
);
// Hello,everybody,there

Solution 2:

var a = str.split(/\&[#a-z0-9]+\;/); should do it, although you'll end up with empty slots in the array when you have two entities next to each other.

Solution 3:

split(/&.*?;(?=[^&]|$)/)

and cut the last and first result:

["", "Hello", "everybody", "there", ""]

Solution 4:

>>"&#x22;Hello&nbsp;&lt;everybody&gt;&nbsp;there&#x22;".split(/(?:&[^;]+;)+/)
['', 'Hello', 'everybody', 'there', '']

The regex is: /(?:&[^;]+;)+/

Matches entities as & followed by 1+ non-; characters, followed by a ;. Then matches at least one of those (or more) as the split delimiter. The (?:expression) non-capturing syntax is used so that the delimiters captured don't get put into the result array (split() puts capture groups into the result array if they appear in the pattern).

Post a Comment for "Split String By Html Entities?"