Goose was originally an article extractor written in Java that has most recently (aug2011) converted to a scala project.
This is a complete rewrite in python. The aim of the software is is to take any news article or article type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.
Goose will try to extract the following information:
- Main text of an article
- Main image of article
- Any Youtube/Vimeo movies embedded in article
- Meta Description
- Meta tags
The python version was rewrite by:
- Xavier Grangier
If you find Goose useful or have issues please drop me a line, I'd love to hear how you're using it or what features should be improved
Goose is licensed by under the Apache 2.0 license, see the LICENSE file for more details
mkvirtualenv --no-site-packages goose
git clone
cd python-goose
pip install -r requirements.txt
python install
>>> from goose import Goose
>>> url = ''
>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Occupy London loses eviction fight'
>>> article.meta_description
"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
>>> article.cleaned_text[:150]
(CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
>>> article.top_image.src
There is two way to pass configuration to goose. The first one is to pass to goose a Configuration() object. The second one is to pass a configuration dict
For instance, if you want to change the userAgent used by Goose juste pass :
>>> g = Goose({'browser_user_agent': 'Mozilla'})
Switching parser : Goose can now be use with lxml html parser or lxml soup parser. By default the html parser is used. If you want to use the soup parser passe it in the configuration dict :
>>> g = Goose({'browser_user_agent': 'Mozilla', 'parser_class':'soup'})
For exemple scrapping a spanish content page with correct meta language tags
>>> from goose import Goose
>>> url = ''
>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Las listas de espera se agravan'
>>> article.cleaned_text[:150]
u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\xe1s ciudad'
Some pages don't have correct meta language tags, you can force it using configuration :
>>> from goose import Goose
>>> url = ''
>>> g = Goose({'use_meta_language': False, 'target_language':'es'})
>>> article = g.extract(url=url)
>>> article.cleaned_text[:150]
u'Importante golpe a la banda terrorista ETA en Francia. La Guardia Civil ha detenido en un hotel de Macon, a 70 kil\xf3metros de Lyon, a Izaskun Lesaka y '
Passing {'use_meta_language': False, 'target_language':'es'} will force as configuration will force the spanish language
>>> import goose
>>> url = ''
>>> g = goose.Goose({'target_language':'fr'})
>>> article = g.extract(url=url)
>>> article.movies
[<goose.videos.videos.Video object at 0x25f60d0>]
>>> article.movies[0].src
>>> article.movies[0].embed_code
'<iframe src="" frameborder="0" scrolling="no" width="476" height="357"/>'
>>> article.movies[0].embed_type
>>> article.movies[0].width
>>> article.movies[0].height
If you have problems with encoding(happens with russian language), use `article = Goose({'use_meta_language': False, 'target_language':'ru'}).extract(url=inurl)`
Also, you should
every sentence from SummarizeUrl(url), to avoid problems with unicode and russian language
Some users want to use Goose for chinese content. Chinese word segementation is way more difficult to deal with that occidental languages. Chinese needs a dedicated StopWord analyser that need to be passed to the config object
>>> from goose import Goose
>>> from goose.text import StopWordsChinese
>>> url = ''
>>> g = Goose({'stopwords_class': StopWordsChinese})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
In order to use Goose in Arabic you have to use the StopWordsArabic class.
>>> from goose import Goose
>>> from goose.text import StopWordsArabic
>>> url = ''
>>> g = Goose({'stopwords_class': StopWordsArabic})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
دمشق، سوريا (CNN) -- أكدت جهات سورية معارضة أن فصائل مسلحة معارضة لنظام الرئيس بشار الأسد وعلى صلة بـ"الجيش الحر" تمكنت من السيطرة على مستودعات للأسل
There is some issue with unicode URLs.
Cookie handeling : Some website needs cookie handeling. At the moment the only work around is to use the raw_html extraction. For instance ;
>>> import urllib2 >>> import goose >>> url = "" >>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor()) >>> response = >>> raw_html = >>> g = goose.Goose() >>> a = g.extract(raw_html=raw_html) >>> a.cleaned_text u'CAIRO \u2014 For a moment, at least, American and European diplomats trying to defuse the volatile standoff in Egypt thought they had a breakthrough.\n\nAs t'
- Video html5 tag extraction