Mr Branding

Tuesday, April 4, 2017

Les Pedaleurs

An interactive experience about bike lovers
by via Awwwards - Sites of the day

Scraping Webpages in Python With Beautiful Soup: The Basics

In a previous tutorial, I showed you how to use the Requests module to access webpages using Python. The tutorial covered a lot of topics like making GET/POST requests and downloading things like images or PDFs programmatically. The one thing missing from that tutorial was a guide on scraping webpages you accessed using Requests to extract the information that you need.

In this tutorial, you will learn about Beautiful Soup, which is a Python library to extract data from HTML files. The focus in this tutorial will be on learning the basics of the library, and more advanced topics will be covered in the next tutorial. Please note that this tutorial uses Beautiful Soup 4 for all the examples.

Installation

You can install Beautiful Soup 4 using pip. The package name is beautifulsoup4. It should work on both Python 2 and Python 3.

$ pip install beautifulsoup4

If you don’t have pip installed on your system, you can directly download the Beautiful Soup 4 source tarball and install it using setup.py.

$ python setup.py install

BeautifulSoup is originally packaged as Python 2 code. When you install it for use with Python 3, it is automatically updated to Python 3 code. The code won’t be converted unless you install the package. Here are a few common errors that you might notice:

The “No module named HTMLParser” ImportError occurs when you are running the Python 2 version of the code under Python 3.
The “No module named html.parser” ImportError occurs when you are running the Python 3 version of the code under Python 2.

Both the errors above can be corrected by uninstalling and reinstalling Beautiful Soup.

Installing a Parser

Before discussing the differences between different parsers that you can use with Beautiful Soup, let’s write the code to create a soup.

from bs4 import BeautifulSoup

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "html.parser")

The BeautifulSoup object can accept two arguments. The first argument is the actual markup, and the second argument is the parser that you want to use. The different parsers are: html.parser, lxml, and html5lib. The lxml parser has two versions, an HTML parser and an XML parser.

The html.parser is a built-in parser, and it does not work so well in older versions of Python. You can install the other parsers using the following commands:

$ pip install lxml
$ pip install html5lib

The lxml parser is very fast and can be used to quickly parse given HTML. On the other hand, the html5lib parser is very slow, but it is also extremely lenient. Here is an example of using each of these parsers:

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "html.parser")
print(soup)
# <html><p>This is <b>invalid HTML</b></p></html>

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "lxml")
print(soup)
# <html><body><p>This is <b>invalid HTML</b></p></body></html>

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "xml")
print(soup)
# <?xml version="1.0" encoding="utf-8"?>
# <html><p>This is <b>invalid HTML</b></p></html>

soup = BeautifulSoup("<html><p>This is <b>invalid HTML</p></html>", "html5lib")
print(soup)
# <html><head></head><body><p>This is <b>invalid HTML</b></p></body></html>

The differences outlined by the above example only matter when you are parsing invalid HTML. However, most of the HTML on the web is malformed, and knowing these differences will help you in debugging some parsing errors and deciding which parser you want to use in a project. Generally, the lxml parser is a very good choice.

Objects in Beautiful Soup

Beautiful Soup parses the given HTML document into a tree of Python objects. There are four main Python objects that you need to know about: Tag, NavigableString, BeautifulSoup, and Comment.

The Tag object refers to an actual XML or HTML tag in the document. You can access the name of a tag using tag.name. You can also set a tag’s name to something else. The name change will be visible in the markup generated by Beautiful Soup.

You can access different attributes like the class and id of a tag using tag['class'] and tag['id'] respectively. You can also access the whole dictionary of attributes using tag.attrs. You can also add, remove or modify a tag’s attributes. The attributes like an element’s class which can take multiple values are stored as a list.

The text within a tag is stored as a NavigableString in Beautiful Soup. It has a few useful methods like replace_with("string") to replace the text within a tag. You can also convert a NavigableString to unicode string using unicode().

Beautiful Soup also allows you to access the comments in a webpage. These comments are stored as a Comment object, which is also basically a NavigableString.

You have already learned about the BeautifulSoup object in the previous section. It is used to represent the document as a whole. Since it is not an actual object, it does not have any name or attributes.

Getting the Title, Headings, and Links

You can extract the page title and other such data very easily using Beautiful Soup. Let’s scrape the Wikipedia page about Python. First, you will have to get the markup of the page using the following code based on the Requests module tutorial to access webpages.

import requests
from bs4 import BeautifulSoup

req = requests.get('http://ift.tt/RSMpTo')
soup = BeautifulSoup(req.text, "lxml")

Now that you have created the soup, you can get the title of the webpage using the following code:

soup.title
# <title>Python (programming language) - Wikipedia</title>

soup.title.name
# 'title'

soup.title.string
# 'Python (programming language) - Wikipedia'

You can also scrape the webpage for other information like the main heading or the first paragraph, their classes, or the id attribute.

soup.h1
# <h1 class="firstHeading" id="firstHeading" lang="en">Python (programming language)</h1>

soup.h1.string
# 'Python (programming language)'

soup.h1['class']
# ['firstHeading']

soup.h1['id']
# 'firstHeading'

soup.h1.attrs
# {'class': ['firstHeading'], 'id': 'firstHeading', 'lang': 'en'}

soup.h1['class'] = 'firstHeading, mainHeading'
soup.h1.string.replace_with("Python - Programming Language")
del soup.h1['lang']
del soup.h1['id']

soup.h1
# <h1 class="firstHeading, mainHeading">Python - Programming Language</h1>

Similarly, you can iterate through all the links or subheading in a document using the following code:

for sub_heading in soup.find_all('h2'):
    print(sub_heading.text)
    
# all the sub-headings like Contents, History[edit]...

Navigating the DOM

You can navigate through the DOM tree using regular tag names. Chaining those tag names can help you navigate the tree more deeply. For example, you can get the first link in the first paragraph of the given Wikipedia page by using soup.p.a. All the links in the first paragraph can be accessed by using soup.p.find_all('a').

You can also access all the children of a tag as a list by using tag.contents. To get the children at a specific index, you can use tag.contents[index]. You can also iterate over a tag's children by using the .children attribute.

Both .children and .contents are useful only when you want to access the direct or first-level descendants of a tag. To get all the descendants, you can use the .descendants attribute.

print(soup.p.contents)
# [<b>Python</b>, ' is a widely used ',.....the full list]

print(soup.p.contents[10])
# <a href="/wiki/Readability" title="Readability">readability</a>

for child in soup.p.children:
    print(child.name)
# b
# None
# a
# None
# a
# None
# ... and so on.

You can also access the parent of an element using the .parent attribute. Similarly, you can access all the ancestors of an element using the .parents attribute. The parent of the top-level <html> tag is the BeautifulSoup Object itself, and its parent is None.

print(soup.p.parent.name)
# div

for parent in soup.p.parents:
    print(parent.name)
# div
# div
# div
# body
# html
# [document]

You can access the previous and next sibling of an element using the .previous_sibling and .next_sibling attributes.

For two elements to be siblings, they should have the same parent. This means that the first child of an element will not have a previous sibling. Similarly, the last child of the element will not have a next sibling. In actual webpages, the previous and next siblings of an element will most probably be a new line character.

You can also iterate over all the siblings of an element using .previous_siblings and .next_siblings.

soup.head.next_sibling
# '\n'

soup.p.a.next_sibling
# ' for '

soup.p.a.previous_sibling
# ' is a widely used '

print(soup.p.b.previous_sibling)
# None

You can go to the element that comes immediately after the current element using the .next_element attribute. To access the element that comes immediately before the current element, use the .previous_element attribute.

Similarly, you can iterate over all the elements that come before and after the current element using .previous_elements and .next_elements respectively.

Final Thoughts

After completing this tutorial, you should now have a good understanding of the main differences between different HTML parsers. You should now also be able to navigate through a webpage and extract important data. This can be helpful when you want to analyze all the headings or links on a given website.

In the next part of the series, you will learn how to use the Beautiful Soup library to search and modify the DOM.

by Monty Shokeen via Envato Tuts+ Code

4 Tips for Successful Social Media Contests

Do you run social media contests for your business? Looking for ways to make your contests deliver more than entries? A strong social media contest can generate real value for your business. In this article, you’ll discover four tips for executing a successful social media contest. #1: Appeal to Prospective Customers, Not Just Entrants The [...]

This post 4 Tips for Successful Social Media Contests first appeared on .
- Your Guide to the Social Media Jungle

by James Scherer via

New Course: How to Use the Google Maps API

Monday, April 3, 2017

Social Media Blogs to Follow For Instant Online Stardom

Social media has taken the world by storm. It is what has been driving sales (to an extent) and businesses are capitalizing on this medium by employing social media marketing. The content available online can either be a bane or boon for people. There can be certain takeaways or at times...

[ This is a content summary only. Visit our website http://ift.tt/1b4YgHQ for full links, other content, and more! ]

by Guest Author via Digital Information World

Web Design Weekly #274

Headlines

The Unnecessary Fragmentation of Design Jobs

Jonas Downey, a Designer from Basecamp shares his thoughts on all the fragmentation around design job titles and gives some suggestions on how to approach it. (m.signalvnoise.com)

CSS is Not Broken (keithjgrant.com)

Peace of mind for freelancers

A great tool to provide better insight and bring calmness to the everyday rollercoaster ride that is freelancing. (cushionapp.com)

Articles

Plainness and Sweetness

Frank Chimero, one of the most talented and thoughtful designers in our industry drops some words of wisdom. (frankchimero.com)

8 CSS gotchas to start your morning off right

Isaac Lyman shares a list of some of the biggest surprises you’re likely to face as a CSS newbie and some advice for navigating them. (medium.com)

Preload, Prefetch And Priorities in Chrome

Addy Osmani dives into insights from Chrome’s networking stack to provide clarity on how web loading primitives (like <link rel=“preload”> & <link rel=“prefetch”>) work behind the scenes so you can be more effective with them. (medium.com)

Building a CSS Grid Overlay

Andreas Larsen looks into what it takes to build a grid overlay with CSS. The example is responsive, easily customisable and uses CSS variables. (css-tricks.com)

Tools / Resources

Build and deploy websites (even with custom domains) on CodePen!

With CodePen Projects, not only can you build a website with all the files you need, right in your browser, but you can deploy that website. By default you’ll get a CodePen URL that shows your complete site with zero additional UI, but you can also point a custom domain at your Project as well! (codepen.io)

Designing responsive web layouts in React Studio

A tutorial that walks you through some of the most essential layout features in React Studio, a new web app design tool. (hackernoon.com)

Glimmer

Glimmer is one of the fastest DOM rendering engines, delivering exceptional performance for initial renders as well as updates. (glimmerjs.com)

Compositor

Beautiful, Fast, & Simple GitHub Project Pages. (compositor.io)

Polished

A lightweight toolset for writing styles in JavaScript. (polished.js.org)

Huge Safari 10.1 Update (developer.apple.com)

Inspiration

An awesome freelance copywriter’s homepage (getcoleman.com)

Web Security with April King and Alex Sexton (shoptalkshow.com.com)

Runkeeper: A Usability Case Study (blog.prototypr.io)

Jobs

Product Designer at WeWork

We’re all passionate, energetic and hungry to give our members every advantage and to contribute to their success. We are rapidly expanding our product line to meet member needs and you will work with all stakeholders to understand our members, their existing pain points, and will provide world class interfaces that solve those pain points. (wework.com)

Web Designer at GitHub

GitHub is looking for a designer to join our Web Design team. As a Web Designer, you’d work closely with other designers, product managers, engineers, and our marketing team to translate ideas into delightful and informative visual designs. (github.com)

Need to find passionate developers or designers? Why not advertise in the next newsletter

Last but not least…

8 Tips To Become A Better Front End Developer (hackernoon.com)

The post Web Design Weekly #274 appeared first on Web Design Weekly.

by Jake Bresnehan via Web Design Weekly