Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing DocumentNode.OuterHtml Causes Stack Overflow Exception On Demand #103

Open
blankers opened this issue Nov 14, 2017 · 12 comments
Open
Assignees

Comments

@blankers
Copy link

I've encountered a strange situation with HTML source from http://portalamis.org.br/?secao=noticias
See the raw html in the attached file:
http-portalamis.org.br-secao-noticias.html.txt

Here's my code:
public HtmlAgilityPack.HtmlDocument document { get; private set; }
....
....
encoding = Encoding.UTF8;
this.document = new HtmlAgilityPack.HtmlDocument();
this.document.OptionFixNestedTags = true;
this.document.OptionAutoCloseOnEnd = true;
this.document.OptionDefaultStreamEncoding = encoding;
this.document.LoadHtml(htmlContent);

Then simply accessing
this.document.DocumentNode.OuterHtml
causes a stack overflow on demand.

@JonathanMagnan JonathanMagnan self-assigned this Nov 15, 2017
@JonathanMagnan
Copy link
Member

Hello @blankers ,

Thank you for reporting,

We will look at this issue soon.

Best Regards,

Jonathan

@JonathanMagnan
Copy link
Member

Hello @blankers ,

Just to let you know we took some time recently to investigate it but unfortunately, we have not been able to find out the cause.

We will try to investigate it again when my new developer will be more comfortable with this library.

Best Regards,

Jonathan

@PhilipEve
Copy link

Please find attached a project that also exhibits a stack overflow when run, on getting the OuterHtml of a node. The code in the project reads some HTML, modifies it a bit, then tries to access the OuterHtml of the document node. I have not taken the time to investigate whether the modifications are a necessary part of reproducing the problem.

When the relevant code is run in the context of an ASP.NET Core web site, different behaviour is observed. If the code is running under the debugger, the debugger closes with no user interaction. Setting a breakpoint at the line that accesses the OuterHtml getter and mousing over it causes a popup to appear as seen in the screengrab. Googling the error code 0xc0000005, it appears to mean that an access violation occurred.

access-violation

HtmlAgilityPackAccessViolation.zip

@PhilipEve
Copy link

Further to the above - the failure is not seen (in either of its forms) if the line node.Attributes.RemoveAll(); is commented out.

@PhilipEve
Copy link

Workaround

private static void RemoveAllAttributes(HtmlNode node)
{
    // We should be able to do this:
    //     node.Attributes.RemoveAll();
    // But there is a bug, see https://github.com/zzzprojects/html-agility-pack/issues/103

    var attributeNames = node.Attributes.Select(attr => attr.Name).ToArray();
    foreach (string attrName in attributeNames)
    {
        node.Attributes.Remove(attrName);
    }
}

@alexbk66
Copy link

alexbk66 commented Jul 5, 2023

It's an old issue, but I also hit it. Debugging the code now, reproducing the Stack Overflow (call stack from bottom to top):

 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.OuterHtml.get() Line 660	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlTextNode.Text.get() Line 67	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo(System.IO.TextWriter outText, int level) Line 1984	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo() Line 2145	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.UpdateHtml() Line 2183	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.OuterHtml.get() Line 663	C#
 	[The 5 frame(s) above this were repeated 1282 times]	
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo() Line 2145	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.UpdateHtml() Line 2183	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.OuterHtml.get() Line 663	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlTextNode.Text.get() Line 67	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo(System.IO.TextWriter outText, int level) Line 1984	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteContentTo(System.IO.TextWriter outText, int level) Line 1881	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo(System.IO.TextWriter outText, int level) Line 2034	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteContentTo(System.IO.TextWriter outText, int level) Line 1881	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo(System.IO.TextWriter outText, int level) Line 2034	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteContentTo(System.IO.TextWriter outText, int level) Line 1881	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo(System.IO.TextWriter outText, int level) Line 2034	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteContentTo(System.IO.TextWriter outText, int level) Line 1881	C#
>	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteContentTo() Line 1892	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.UpdateHtml() Line 2182	C#
 	HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.OuterHtml.get() Line 663	C#
 	HSPI_AKWeather.exe!HSPI_AKWeather.HtmlGeneratorWeather.WriteHtmlFile() Line 143	C#

@alexbk66
Copy link

alexbk66 commented Jul 5, 2023

Further investigation (I don't quite understand the code yet, but..)
class HtmlTextNode.Text - when _text == null calls base.OuterHtml - which basically leads to the infinite loop:

        /// <summary>
        /// Gets or Sets the text of the node.
        /// </summary>
        public string Text
        {
            get
            {
                if (_text == null)
                {
                    return base.OuterHtml;
                }

                return _text;
            }
            set
            {
                _text = value;
                SetChanged();
            }
        }

@alexbk66
Copy link

alexbk66 commented Jul 7, 2023

More info. The problem happens if I

  • Call htmlDoc.LoadHtml(html)
  • save html once (call htmlDoc.DocumentNode.OuterHtml),
  • then call some SetAttributeValue() again,
  • then save html again (call htmlDoc.DocumentNode.OuterHtml)

@JonathanMagnan
Copy link
Member

Hello @alexbk66 ,

Do you think you could reproduce the issue in a Fiddle? Not sure if it will get fixer but surely we can look at it.

Here is a working fiddle with your example: https://dotnetfiddle.net/ImPNc1

Best Regards,

Jon

@alexbk66
Copy link

alexbk66 commented Jul 13, 2023

Hi Jon,

I copied my HTML https://dotnetfiddle.net/LQ5nAB

It doesn't 'stack overflow', but sill fails because of the <style> tag:

System.NullReferenceException: Object reference not set to an instance of an object. at Program.Main() in d:\Windows\Temp\xnyzzw5v.0.cs:line 261

In VisualStudio 'stack overflow' also happens at this tag.

But if i add spaces around the tag - then it works.
I'll try adding spaces in my code to see if it works and report later.

< style id = ""gwd-text-style"" >

@JonathanMagnan
Copy link
Member

Hello @alexbk66 ,

It currently fail because the end tag is badly formatted </ style > (a space), so it likes no label exists for HAP.

Therefore the following line var node = htmlDoc.GetElementbyId("label_14"); has his node to null, which throws the null reference on this line: var tn1 = (HtmlTextNode)node.SelectSingleNode("text()");

I will wait for your investigation to reproduce the stack overflow issue.

@alexbk66
Copy link

The spaces were inserted by .NET Fiddle for some reason when I copied the code. You are right though, if I remove the spaces - it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants