How to get the pure text without HTML element using JavaScript?

javascript

prompt-engineering

functions

callbacks

byNikita Barsukov·Oct 29, 2024

Fetch text stripped of HTML by leveraging textContent for full content or innerText for visible content:

let unstyledText = document.getElementById('yourElementId').textContent; // Style? I don't know her.
// or
let visibleText = document.getElementById('yourElementId').innerText; // No hide and seek here.

The textContent works best for large scale extraction, whereas innerText is ideal when dealing with visible, styled text.

Explaining `textContent` and `innerText`

Even though both properties can help extract the text within nodes, they operate differently:

textContent is the blunt tool you reach for when you need all visible and invisible text content, completely ignoring styling or hidden elements.
innerText, on the other hand, is the discerning butler, only fetching text from elements that are displayed on the webpage. innerText mimics how the text would look if a user manually copied it from the page.

Selection of target elements

Correctly identifying your targets is key. Avoid id overload and be accurate in targeting:

Utilize document.getElementById('yourElementId') to zero in on a specific element.
Use document.querySelector('selector') for complex CSS selectors to triangulate your target.

HTML tags, begone!

When dealing with innerHTML that has nested HTML tags:

let htmlContent = document.getElementById('yourElementId').innerHTML;

You can strip HTML tags by enforcing the replace() method with a Regular Expression:

let cleanText = htmlContent.replace(/<[^>]*>/g, ''); // We don't need your kind here.

Implement event listener for text extraction

Attach event listeners to elements (like buttons) to trigger your text extraction. This elevates the user experience:

document.getElementById('yourButtonId').addEventListener('click', extractText); // Come, let's extract text when you click.

Mastering child nodes

For cluttered DOM trees, learn to recurse or use Node.childNodes to gather text from nested elements:

function gatherAllText(node) {
  let result = '';
  node.childNodes.forEach(child => {
    result += child.nodeType === Node.TEXT_NODE ? child.nodeValue : gatherAllText(child); // Child labor you say? Nah, just recursion.
  });
  return result;
}

Simplification with jQuery

If jQuery is included in your project, bingo! Text extraction becomes a cake walk:

$('#yourElementId').text(); // jQuery has your back buddy!

Storing your spoils

Store the extracted text in a variable for future use or manipulation:

let extractedText = document.getElementById('yourElementId').textContent; // Here's your loot buddy!

Cross-platform considerations

Take a moment to verify browser compatibility before settling on a property. innerText might get the cold shoulder in some older browsers, unlike textContent, which is the life of the party, with wider support.

Balancing `innerText` and `textContent`

When life gives you two choices, consider your requirements:

Use innerText to keep the readability intact, it emulates how a human would copy text from a webpage.
Choose textContent when raw data is the focus, obliterating any need for visual formatting.

Parameters for tailored space handling

When extracting text, you might want to handle whitespace or newline characters:

function extractTextWithSpaceHandling(elementId, handleSpace) {
  let text = document.getElementById(elementId).textContent;
  return handleSpace ? text : text.replace(/\s+/g, ' ').trim(); // Space, the final frontier.
}