Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak in html2text 2014.4.5 #13

Closed
OOPMan opened this issue Jun 19, 2014 · 5 comments
Closed

Memory Leak in html2text 2014.4.5 #13

OOPMan opened this issue Jun 19, 2014 · 5 comments
Labels

Comments

@OOPMan
Copy link

OOPMan commented Jun 19, 2014

I recently added html2text 2014.4.5 to my project and have been using it to convert HTML generated from Jinja2 templates into text. I attach the HTML and the text version of said HTML to emails constructed using the standard email.mime classes.

I added html2text amidst some other changes and so it took me a little time to track down that the source of a memory leak issue that started occurring to html2text:

(Pdb) problem
Partition of a set of 2 objects. Total size = 38380808 bytes.
 Index  Count   %     Size   % Cumulative  % Referrers by Kind (class / dict of class)
     0      2 100 38380808 100  38380808 100 dict of html2text.HTML2Text
(Pdb) problem.byclodo
Partition of a set of 2 objects. Total size = 38380808 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      2 100 38380808 100  38380808 100 unicode
(Pdb) problem.byid
Set of 2 <unicode> objects. Total size = 38380808 bytes.
 Index     Size   %   Cumulative  %   Representation (limited)
     0 38380600 100.0  38380600 100.0 u'Hi Mar... \n\n \n'
     1      208   0.0  38380808 100.0 u'<p sty... 100%;">'
(Pdb) problem.byvia
Partition of a set of 2 objects. Total size = 38380808 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0      1  50 38380600 100  38380600 100 "['outtext']"
     1      1  50      208   0  38380808 100 "['_HTMLParser__starttag_text']"
(Pdb) leftover
Partition of a set of 419 objects. Total size = 38480944 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0     26   6 38384680 100  38384680 100 unicode
     1     93  22    39096   0  38423776 100 dict (no owner)
     2     43  10    23560   0  38447336 100 dict of guppy.etc.Glue.Interface
     3      8   2     8384   0  38455720 100 dict of guppy.etc.Glue.Share
     4     22   5     6160   0  38461880 100 dict of guppy.etc.Glue.Owner
     5    100  24     5232   0  38467112 100 str
     6     23   5     3128   0  38470240 100 list
     7     43  10     2752   0  38472992 100 guppy.etc.Glue.Interface
     8     22   5     1584   0  38474576 100 guppy.etc.Glue.Owner
     9      1   0     1048   0  38475624 100 dict of guppy.heapy.Classifiers.ByUnity
<17 more rows. Type e.g. '_.more' to view.>
(Pdb) 

The above information was captured using heapy component of Guppy-PE after following information detailed in http://python.dzone.com/articles/diagnosing-memory-leaks-python and http://www.smira.ru/wp-content/uploads/2011/08/heapy.html

As you can see, the contents of ['outtext'] are huge and based on inspection of the data itself (See last file referenced below) basically consist of the same text repeated over and over. This would seem to indicate some kind of looping error.

I'm not sure if it is relevant to this issue but every now and then when using html2text it fails after reaching line 360 of /usr/lib64/python2.7/HTMLParser.py:
raise AssertionError("we should not get here!")

On a final note, I have replicated both of these issues using both Python 2.7.5 64-bit and PyPy 2.3.0 64-bit.

For your reference as to the context, please see the following pastebin links:

send_mail: http://pastebin.com/hHHh1fUN
email_tasks.py (used by send_mail): http://pastebin.com/Eqcsk23X
email_template: http://pastebin.com/XWS4VreU
base_email_template (used by email_template): http://pastebin.com/X7GfT1LJ
contents of ['outtext']: http://www.mediafire.com/view/6uoj861r59oxme9/problemcontents.txt

I have not done any investigation yet into the exact cause of this issue with html2text, although I hope to do so tomorrow.

For now, hopefully this information will prove useful in determining the source of the issue.

@Alir3z4
Copy link
Owner

Alir3z4 commented Jun 20, 2014

I think this can be related to this too: aaronsw/html2text#78

Thank you so much for reporting this with such details and information, especially heapy details.

Also is this happening on python3 ?

@OOPMan
Copy link
Author

OOPMan commented Jun 20, 2014

I'm afraid I haven't tested with Python 3 :-(

Some extra information I forgot to note:

  • A modified version of send_mail was used with Heapy. The modified version wrapped a straight call (No Celery or Threading) to send_email with a Heapy before and after.
  • The issue occurs also when using Threading or Celery (No surprises there)
  • The issue seems to be somewhat variable. Sometimes a number of html2text conversions will proceed with no issue, other times it will get stuck up churning after one or two conversions.

@Alir3z4
Copy link
Owner

Alir3z4 commented Jun 21, 2014

@mcepl since you've been into aaronsw/html2text#78, any comment ?

@mcepl
Copy link
Contributor

mcepl commented Jun 22, 2014

OK we have this at the __init__ of HTML2Text class, but we don’t dipose of the self.outtext whenever we run .handle

          try:                                                                    
              self.outtext = unicode()                                            
          except NameError:  # Python3                                            
              self.outtext = str()                                                

Perhaps throwing away html_to_text for each run of your cycle would help and constructing new object for each new HTML document processed? But that's probably horribly slow. I will take a look whether we should throw away self.outttext content per each handle call.

@Alir3z4
Copy link
Owner

Alir3z4 commented Jul 2, 2014

@mcepl @OOPMan
We gonna have lots of cats:

In [18]: h2t = HTML2Text()

In [23]: h2t.handle('miow ')
Out[23]: u'miow\n\n'

In [24]: h2t.handle('miow ')
Out[24]: u'miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\n'

In [25]: h2t.handle('miow ')
Out[25]: u'miow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\n'

In [26]: h2t.handle('') #Even empty string :|
Out[26]: u'miow miow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow miow\n\nmiow\n\nmiow\n\nmiow miow\n\nmiow\n\n'

Probably self.outtext should be murdered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants