Dissecting the News01.09.14 · python
Instead of writing and tuning scrapers for individual news sites, Newspaper provides an API to aggregate and parse arbitrary sources. Point Newspaper to a site and it will automatically download a full list of articles. You can then parse them individually to identify keywords, titles, authors, summaries, and more.
Even though Newspaper is new, it’s already well documented and turning heads. Block out some time this weekend and give it a try!
```python ## Establish a ‘source’ and pull basic data (brand, description, articles, etc) import newspaper engadget = newspaper.build(“http://engadget.com”)
Get number of articles & list their URLs
print engadget.size() # I ran this script at 9pm EST 1/8/2014 and got over 500 results due to CES coverage for article in engadget.articles: print article.url # http://www.engadget.com/2014/01/07/ces-stage-kickoff/  # http://www.engadget.com/2014/01/08/best-of-ces-2014-finalists/  # etc.
Get source description & brand
print engadget.brand # endgadget print engadget.description # Engadget is a web magazine with obsessive daily coverage of everything new in gadgets and consumer electronics
Download & parse the first article
a1 = engadget.articles a1.download() a1.parse() print a1.text # full text of article print a1.authors # [u’Brian Heater’] print a1.title # Live from the Engadget CES Stage
Use natural language processing to gain more insight
a1.nlp() print a1.summary # YMMV with this function. print a1.keywords # [u’week’, u’christen’, u’whats’, u’working’, u’interviews’, u’engadget’, u’live’, u’doesnt’, u’ces’, u’open’, u’event’, u’stage’] ```
Join the discussion