Mr Branding

Monday, October 3, 2016

g9.js – Automatically Interactive Graphics for the Web

g9 is a javascript library for creating automatically interactive graphics. With g9, interactive visualization is as easy as visualization that isn't. Just write a function which draws shapes based on data, and g9 will automatically figure out how to manipulate that data when you drag the shapes around.

by via jQuery-Plugins.net RSS Feed

How to Use Python to Find the Zipf Distribution of a Text File

You might be wondering about the term Zipf distribution. To understand what we mean by this term, we need to define Zipf's law first. Don't worry, I'll keep everything simple.

Zipf's Law

Zipf's law simply states that given some corpus (large and structured set of texts) of natural language utterances, the occurrence of the most frequent word will be approximately twice as often as the second most frequent word, three times as the third most frequent word, four times as the fourth most frequent word, and so forth.

Let's look at an example of that. If you look into the Brown Corpus of American English, you will notice that the most frequent word is the (69,971 occurrences). If we look into the second most frequent word, that is of, we will notice that it occurs 36,411 times.

The word the accounts for around 7% of the Brown Corpus words (69,971 of slightly over 1 million words). If we come to the word of, we will notice that it accounts for around 3.6% of the corpus (around half of the). Thus, we can notice that Zipf's law applies to this situation.

Thus, Zipf's law is trying to tell us that a small number of items usually account for the bulk of activities we observe. For instance, a small number of diseases (cancer, cardiovascular diseases) account for the bulk of deaths. This also applies to words that account for the bulk of all word occurrences in literature, and many other examples in our lives.

Data Preparation

Before moving forward, let me refer you to the data we will be using to experiment with in our tutorial. Our data this time will be from the National Library of Medicine. We will be downloading what's called a MeSH (Medical Subject Heading) ASCII file, from here. In particular, d2016.bin (28 MB).

I will not go into detail in describing this file since it is beyond the scope of this tutorial, and we just need it to experiment with our code.

Building the Program

After you have downloaded the data in the above section, let's now start building our Python script that will find the Zipf's distribution of the data in d2016.bin.

The first normal step to perform is to open the file:

open_file = open('d2016.bin', 'r')

In order to carry out the necessary operations on the bin file, we need to load the file in a string variable. This can be simply achieved using the read() function, as follows:

file_to_string = open_file.read()

Since we will be looking for some pattern (i.e. words), regular expressions come into play. We will thus be making use of Python's re module.

At this point we have already read the bin file and loaded its content in a string variable. Finding the Zipf's distribution means finding the frequency of occurrence of words in the bin file. The regular expression will thus be used to locate the words in the file.

The method we will be using to make such a match is the findall() method. As mentioned in the re module documentation about findall(), the method will:

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

What we want to do is write a regular expression that will locate all the individual words in the text string variable. The regular expression that can perform this task is:

\b[A-Za-z][a-z]{2,10}\b

where \b is an anchor for word boundaries. In Python, this can be represented as follows:

words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)

This regular expression is basically telling us to find all the words that start with a letter (upper-case or lower-case) and followed by a sequence of letters which consist of at least 2 characters and no more than 9 characters. In other words, the size of the words that will be included in the output will range from 3 to 10 characters long.

We can now run a loop which aims at calculating the frequency of occurrence of each word:

for word in words:
    count = frequency.get(word,0)
    frequency[word] = count + 1

Here, if the word is not found yet in the list of words, instead of raising a KeyError, the default value 0 is returned. Otherwise, count is incremented by 1, representing the number of times the word has occurred in the list so far.

Finally, we will print the key-value pair of the dictionary, showing the word (key) and the number of times it appeared in the list (value):

for key, value in reversed(sorted(frequency.items(), key = itemgetter(1))):
    print key, value

This part sorted(frequency.items(), key = itemgetter(1)) sorts the output by value in ascending order, that is, it shows the words from the least frequent occurrence to the most frequent occurrence. In order to list the most frequent words at the beginning, we use the reversed() method.

Putting It All Together

After going through the different building blocks of the program, let's see how it all looks together:

import re
from operator import itemgetter    

frequency = {}
open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)

for word in words:
    count = frequency.get(word,0)
    frequency[word] = count + 1
    
for key, value in reversed(sorted(frequency.items(), key = itemgetter(1))):
    print key, value

I will show here the first ten words and their frequencies returned by the program:

the 42602
abcdef 31913
and 30699
abbcdef 27016
was 17430
see 16189
with 14380
under 13127
for 9767
abcdefv 8694

From this Zipf distribution, we can validate Zipf's law in that some words (high-frequency words) represent the bulk of words, such as we can see above the, and, was, for. This also applies to the sequences abcdef, abbcdef, and abcdefv which are highly frequent letter sequences that have some meaning particular to this file.

Conclusion

In this tutorial, we have seen how Python makes it easy to work with statistical concepts such as Zipf's law. Python comes in very handy in particular when working with large text files, which would require a lot of time and effort if we were to find Zipf's distribution manually. As we saw, we were able to quickly load, parse, and find the Zipf's distribution of a file of size 28 MB. Let alone the simplicity in sorting the output thanks to Python's dictionaries.

by Abder-Rahman Ali via Envato Tuts+ Code

AtoZ CSS Screencast: The CSS Line-Height Property

[special]This screencast is a part of our AtoZ CSS Series. You can find other entries to the series here.[/special]

Transcript

The CSS line-height property acts in a similar way to leading in print design.

It allows us to control the spacing between lines in paragraphs, headings and other text elements.

Line-height can also be used as a base to create consistent vertical rhythm and spacing throughout an entire project.

In this episode, we’ll look at

the difference between line-height and leading
using line-height for vertical alignment and
using the value of line-height to set up site wide default spacing.

Line Height vs. Leading

Leading is a typesetting term which describes the distance between the baselines in the text. The term comes from the days of hand-typesetting when strips of lead were used to space out the block of type. When talking about leading, the space is always added below the line.

line-height is a CSS proptery that controls the height of a line where the spacing is equal above and below the text.

If I have a paragraph with 1em or 16px font-size, and a line-height of 24px, there will be 4px of space added above the text and 4px of space added below; the height of the line will be

4 + 16 + 4 = 24px.

This is the major difference between line-height and leading: in CSS, the text is vertically centered within the line and in print, the space is added beneath the line.

Continue reading %AtoZ CSS Screencast: The CSS Line-Height Property%

by Guy Routledge via SitePoint

Building Your Startup With PHP: Choosing and Configuring Production Hosting

Contentstack

Clean launching soon page with a neat intro demo video promoting a content marketing tool called 'Contentstack'. Would have loved to see the diagonal line design (used in the header) reused in footer but overall a clear, spacious layout.

by Rob Hope via One Page Love

26 Tips for Better Facebook Page Engagement

Have you noticed a drop in your Facebook engagement? Wondering how you can better engage with your fans? Making small changes to what and how you post can help your Facebook updates generate clicks, likes, and comments. In this article, you’ll discover 26 tips for boosting Facebook engagement. #1: Pose a Question One of the [...]

This post 26 Tips for Better Facebook Page Engagement first appeared on .
- Your Guide to the Social Media Jungle

by Derek Cromwell via

Northlandscapes

Long scrolling One Pager showcasing the photography of explorer Jan Erik Waider. Interesting 1320px fixed width on bigger screens, would love to see this much wider but definitely a method to save development time. Thanks for such great build notes Jan!

by Rob Hope via One Page Love