ToolPal
Lines of HTML code on a dark code editor screen

HTML to Text: Strip Tags and Get Clean Plain Text Fast

πŸ“· Florian Olivo / Pexels

HTML to Text: Strip Tags and Get Clean Plain Text Fast

Learn how to strip HTML tags and extract clean plain text, decode entities, preserve line breaks, and when to use HTML-to-text vs HTML-to-Markdown.

April 12, 202611 min read

There's a specific kind of frustration that comes from copying text off a web page and having it paste as a giant wall of <p> tags and &nbsp; characters. Or opening an HTML email template in a text editor to grab a quote, and instead finding a soup of <td> and <span style="color:#333333"> around every word.

That's the problem an HTML to text converter solves. It sounds simple β€” strip the tags, show the text β€” but there are enough small gotchas (entities, line breaks, whitespace normalization) that doing it right takes more than just a regex replace. Let's talk through what actually happens under the hood, when you need this, and where the tool honestly falls short.

What an HTML to Text Converter Actually Does

At a surface level, converting HTML to text means removing anything inside < > brackets. But that's only the start.

Tag removal is the obvious part. Every element β€” <div>, <p>, <span>, <a>, <img>, the whole lot β€” gets stripped out. What's left is whatever was between the opening and closing tags.

Entity decoding is where a lot of naive implementations fall down. HTML uses entities to represent characters that would otherwise break the markup. The ampersand & becomes &amp;. The less-than sign < becomes &lt;. A non-breaking space becomes &nbsp;. Curly quotes often appear as &#8220; and &#8221;. If your converter only strips tags and doesn't also decode entities, your "plain text" is going to be full of these codes sitting naked in the output. Not exactly readable.

A proper converter decodes the full set of named entities (&amp;, &lt;, &gt;, &quot;, &apos;, &nbsp; and the hundreds of others like &mdash;, &copy;, &reg;) plus numeric entities in both decimal (&#160;) and hex (&#xA0;) form.

Line break preservation is the third piece. HTML's whitespace rules are different from plain text's β€” multiple spaces collapse to one, and newlines in source code mean nothing visually. The structure you see in a browser comes from block-level elements: <p>, <div>, <h1> through <h6>, <li>, <blockquote>, <br>, <hr>, and so on. A thoughtful converter inserts newlines or blank lines when it encounters these elements, so the output has some sense of paragraph structure rather than everything running together in one long line.

Go to /tools/html-to-text, paste in your HTML, and the tool handles all three of these automatically.

Real-World Use Cases

Let me go through the situations where I actually reach for this kind of tool, because they're more varied than you might expect.

Cleaning Up Email HTML

Modern marketing email templates are notoriously verbose HTML. When a colleague forwards you an email asking you to quote something from it, and the only version you have is the raw .eml file or a messy paste from Outlook, you end up with content like:

<td class="mcnTextContent" style="padding-top:0;padding-right:18px;padding-bottom:9px;padding-left:18px;">
  <p>We're excited to announce our Q2 product roadmap&hellip;</p>
</td>

Paste that into the converter, and you get:

We're excited to announce our Q2 product roadmap…

That's the only part you actually wanted.

Preparing Plain Text Email Versions

If you send HTML email campaigns, you should also send a plain text fallback β€” it improves deliverability and serves recipients whose clients or preferences don't render HTML. Writing the plain text version by hand from a designed template is tedious. Dropping the HTML through a converter gives you a workable starting draft that you then clean up, adding actual line breaks, removing redundant navigation links, and so on.

It won't be perfect β€” navigation menus and image-heavy headers often produce ugly output β€” but it's a much better starting point than a blank file.

Web Scraping and Content Extraction

When you scrape a web page, you typically get back the full HTML of the page, which includes navigation, sidebars, ads, footers, and scripts alongside the actual content. After targeting the main content container (using a library like BeautifulSoup or cheerio), you often still end up with HTML. Running that through a converter gives you the readable text.

For quick one-off scraping tasks β€” grabbing a product description, pulling out the text of a blog post, extracting a recipe β€” pasting the HTML into the tool is genuinely faster than writing a parser. For anything systematic or at scale, you'll want server-side tooling, but for a quick task this is hard to beat.

Database or Search Index Population

If you store content as HTML (say, a CMS-backed blog or a rich text editor field) but need plain text for full-text search indexing, AI model input, or display in a context that doesn't render HTML, you need clean extraction. Converting to plain text gives you a version that's safe to index, compare, or feed to downstream systems without risk of exposing tag syntax.

Pasting Into Plain-Text Contexts

Google Docs, Notion, plain text editors, SMS templates, terminal output β€” there are plenty of places that accept text but not HTML. Copying text from a browser often carries hidden HTML formatting along with it. Converting first and pasting from the plain text output sidesteps that issue.

What HTML Entities Are and Why They Matter

This deserves a bit more explanation because it trips people up.

The HTML specification defines a set of named character references β€” strings like &amp;, &lt;, &gt;, &quot; β€” that represent special characters. The reason they exist is that HTML is itself a text format that uses <, >, and & as syntax characters. If you want to display a literal < in a browser, you have to write &lt; in your HTML, otherwise the browser thinks you're starting a new tag.

When you strip the HTML tags but don't decode the entities, this is what you get:

Input HTML:
<p>The price is $10 &amp; shipping is free. See the &lt;details&gt; page.</p>

After tag stripping only:
The price is $10 &amp; shipping is free. See the &lt;details&gt; page.

After tag stripping + entity decoding:
The price is $10 & shipping is free. See the <details> page.

The second version is what actual readers would see in a browser. That's what you want.

Some common entities you'll encounter:

  • &amp; β†’ &
  • &lt; β†’ <
  • &gt; β†’ >
  • &nbsp; β†’ non-breaking space (often appears as a regular space in output)
  • &mdash; β†’ β€” (em dash)
  • &hellip; β†’ … (ellipsis)
  • &copy; β†’ Β©
  • &#8220; and &#8221; β†’ " and " (curly quotes)

A good converter handles all of these transparently.

HTML to Text vs HTML to Markdown

Both convert HTML to a different format, but they serve different purposes.

Use HTML to text when:

  • You need truly plain, unformatted content
  • You're feeding the output to a system that doesn't understand Markdown
  • You're creating plain text email versions
  • You're doing text analysis or natural language processing where markup is just noise
  • You need to sanitize user-generated HTML for safe storage or comparison

Use HTML to Markdown when:

  • You want to preserve document structure (headings, bold, italic, links, lists)
  • You're migrating content from an HTML CMS to a Markdown-based system
  • You plan to re-render the content somewhere that supports Markdown
  • You want something a human can still edit comfortably

The key difference: HTML to text loses all formatting structure. HTML to Markdown keeps the structure in a human-readable, editable form. If you're moving a blog from WordPress to Astro or Hugo, Markdown conversion is what you want. If you're extracting text for a search engine or a language model, plain text is probably cleaner.

Line Break Behavior and What to Expect

One of the trickiest parts of HTML-to-text conversion is whitespace handling.

In HTML, source code newlines are treated as spaces, and multiple spaces collapse to one. The visual line breaks you see in a browser come entirely from block elements and CSS. When you strip the tags, you need to decide what to do with each element boundary.

A reasonable set of rules:

  • <br> β†’ single newline
  • <p>, <div>, <h1>–<h6> β†’ double newline (blank line before and after)
  • <li> β†’ newline with a bullet or number prefix
  • <hr> β†’ a line of dashes or just a blank line
  • Inline elements (<span>, <a>, <strong>, <em>) β†’ just their text content, no extra spacing

The HTML to Text tool applies rules like these so that the output is readable and reasonably structured, not a single massive line. The exact behavior for complex layouts (tables, multi-column divs) can be imperfect β€” we'll talk about that in the limitations section.

Limitations: Where This Tool Falls Short

I want to be upfront about what HTML-to-text conversion can't do well.

Images disappear entirely. There's no plain text equivalent for an image. The <img> tag gets stripped along with its src, alt, and everything else. If an image was carrying important information β€” a chart, a diagram, a logo β€” that information is gone. The alt text is accessible in the HTML but most basic converters don't surface it. If you need alt text preserved, consider converting to Markdown instead (where images become ![alt text](url)).

Complex table layouts flatten poorly. HTML tables used for layout β€” the old-school <table>-based email templates, for example β€” often produce output that's difficult to read as plain text. Cells get concatenated in reading order, which may or may not match what a human would expect. Simple data tables do okay; complex layout tables become a mess.

CSS-hidden content is still included. Elements with display: none or visibility: hidden are still present in the HTML source, which means their text content appears in the plain text output. If a page has a hidden mobile menu, a hidden duplicate heading, or hidden tooltip text, all of that shows up. There's no way to strip based on CSS state without running the full browser rendering pipeline.

JavaScript-rendered content won't be there. If you paste the raw HTML source of a page that loads its content dynamically via JavaScript, the converter only sees what's in the static HTML β€” which might be very little. For JS-rendered pages you'd need a headless browser to get the rendered output first.

No semantic interpretation. The converter doesn't know that <nav> is navigation that you might want to skip, or that <aside> is a sidebar you probably don't want in your extracted content. It treats all elements equally. You might want to pre-process the HTML β€” remove <nav>, <footer>, <aside>, <script>, and <style> blocks β€” before converting, to get cleaner output.

For web scraping specifically, you're better served by something like BeautifulSoup (Python) or cheerio (Node.js) for production use, since they let you target specific elements with selectors before extracting text. The HTML to Text tool is best for quick, one-off conversions where those tradeoffs are acceptable.

The HTML to Text converter fits naturally alongside a few other tools:

  • HTML Encoder: The reverse of entity decoding β€” encode special characters back to HTML entities. Useful when you need to safely embed text inside HTML.
  • HTML to Markdown: When you need to preserve structure rather than strip it. Better for content migration.
  • HTML Minifier: When you want to keep the HTML but reduce its file size by removing whitespace, comments, and redundant attributes. Different goal β€” you're not extracting content, you're compressing markup.

A Quick Workflow Example

Let me walk through a realistic scenario: you've scraped a product description page and have the raw HTML of the <main> element. You want the plain text for a comparison spreadsheet.

  1. Copy the HTML from your scraper or browser DevTools.
  2. Paste it into /tools/html-to-text.
  3. The output appears with tags stripped, entities decoded, and block elements converted to line breaks.
  4. Copy the result into your spreadsheet.

Total time: about 15 seconds. Doing it with a regex in the browser console would take longer and almost certainly mangle the entities.

Conclusion

HTML to text conversion is one of those tasks that seems trivial until you actually try to do it β€” and then you realize there are three or four layers of complexity that need to be handled correctly to get usable output. Tags are the obvious part; entities and whitespace normalization are what trips people up.

The HTML to Text tool handles all of it in one step. For quick one-off extractions β€” from emails, scraped pages, CMS exports, or anywhere else you've got a blob of HTML β€” it's the fastest path from markup to plain text.

If you find yourself needing to preserve formatting rather than strip it, check out HTML to Markdown. And if you're working with HTML entities directly, the HTML Encoder and HTML Minifier round out the toolkit.

Frequently Asked Questions

Share this article

XLinkedIn

Related Posts