Explain Codes LogoExplain Codes Logo

How to strip HTML tags from a string in SQL Server?

sql
html-stripping
user-defined-functions
xml-method
Alex KataevbyAlex Kataev·Aug 18, 2024
TLDR

Quickly strip HTML tags using SQL Server's inbuilt WHILE loop, PATINDEX, and STUFF functions:

DECLARE @Html NVARCHAR(MAX) = '<p>Some <b>HTML</b> content.</p>'; WHILE PATINDEX('%<%>%', @Html) > 0 SET @Html = STUFF(@Html, PATINDEX('%<%>%', @Html), CHARINDEX('>', @Html, PATINDEX('%<%>%', @Html)) - PATINDEX('%<%>%', @Html) + 1, ''); SELECT @Html AS CleanText;

Result: "Some HTML content."—tags are stripped. Easy-peasy! However, for complexities like nested tags and HTML entities, let's level up.

Advanced strategies: XML & UDFs

XML in T-SQL for complex HTML structures

SQL Server's XML data type kicks in for advanced HTML tag removal and decodes HTML entities:

DECLARE @Html NVARCHAR(MAX) = '<p>Some <b>HTML</b> content &amp; entities.</p>'; SELECT TRY_CAST('<XML><![CDATA[' + @Html + ']]></XML>' AS XML).value('.', 'NVARCHAR(MAX)') AS CleanText;

For SQL 2000, replace MAX keyword with (4000) or a specific fixed value.

User-Defined Functions for the tricky bits

An UDF provides an efficient way when you need to strip specific tags like <STYLE> and customize it for your own needs:

CREATE FUNCTION dbo.udf_StripHTML (@HTMLText NVARCHAR(MAX)) RETURNS NVARCHAR(MAX) AS BEGIN -- Define the tag boundaries -- Add logic to exclude STYLE tags along with their content -- Because you hate inline styles as much as I hate pineapple on pizza -- Implement a loop to strip off the tags RETURN @StrippedHTMLText END

The UDFs are reusable, handle a range of scenarios, and leave your data intact like a well-behaved guest. Just don't forget the REPLACE function to map HTML entities correctly.

Performance in mind

Whether using XML method or UDFs, remember to test with sample data. Performance is key, and no one likes a slow show-off. SQL Server is not your grandma, you have to make it run faster!

Special character considerations

Accented characters: a piece of cake

SQL Server can handle special characters including accents. Here's how you normalize them:

SELECT REPLACE(@Html COLLATE Latin1_General_BIN, 'é', 'e') AS NormalizedText;

The collate function is your handy tool to level up the game.

HTML entities to the rescue

Anybody dealing with HTML tags, often encounters HTML entities that need conversions. Here's one way to tackle it:

SET @Html = REPLACE(@Html, '&amp;', '&'); SET @Html = REPLACE(@Html, '&lt;', '<'); -- Keep replacing more entities, like a superhero you are!

This conversion helps retain the meaning of the content.

UDF-free HTML tag stripping

If you're not a fan of UDFs, here's a TRY_CAST method with XML for HTML tag stripping, instructions are pretty simple:

SELECT TRY_CAST(@Html AS XML).value('(/text())[1]', 'VARCHAR(MAX)') AS PlainText;

This gives you a clean, tag-free result without the help of additional functions. Because who doesn't like minimalism, right?!