Skip to content

Regression: Versions >= v5.3.2 are unable to parse specific link #280

@stalgiag

Description

@stalgiag

I work for a project that validates its links using this library. One link that is frequently validated is the HTML spec at https://html.spec.whatwg.org/. This page has one of the bigger HTML files on the web but node-html-parser was able to parse it well in approximately 23 seconds on my local machine until release 5.3.2.

Consider this example:

const HTMLParser = require('node-html-parser');
const nFetch = require('node-fetch');

async function parseHTMLSpec() {
  try {
    const response = await nFetch('https://html.spec.whatwg.org/');
    const html = await response.text();

    console.log('Fetched HTML. Attempting to parse...');
    console.time('parseHTMLSpec');
    const parsedHTML = HTMLParser.parse(html);
    console.timeEnd('parseHTMLSpec');

    console.log('HTML parsed successfully.');
    console.log('Title:', parsedHTML.querySelector('title').text);
  } catch (error) {
    console.error('Error occurred:', error);
  }
}

parseHTMLSpec();

With node-html-parser 5.3.1, this outputs the following:

Fetched HTML. Attempting to parse...
parseHTMLSpec: 23.415s
HTML parsed successfully.
Title: HTML Standard

With node-html-parser 5.3.2, this hangs indefinitely; only outputting the following even after running for hours:

console.log('Fetched HTML. Attempting to parse...');

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions