Analysing TYPO3 changes via scrapy

Reading a magazine, I was confronted with scrapy »an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.« The framework is written in python and easy to use. You can persist the information in multiple output formats like xml, json, csv and some more. That makes it easy to fetch structured information like TYPO3 Changelog. Also I had highcharts bookmarked for some years now. The interest in how many changes were introduced in each TYPO3 Version, in combination with the type of change, like breaking change or feature, were the missing idea to put everything together.

In this post you will learn a minimum setup to put scrapy and highcharts together, to visualize information like the mentioned above.

The result can be found on my site.

How to start

First of all you need to analyze the structure of information to gather. That’s as easy as opening the website in your browser, starting the JavaScript console and load jQuery. Then you can start writing the “query” to fetch the necessary information.

Let’s stick to the above example by using Google Chrome. Head over to TYPO3 7.0 Changelog, jQuery is already loaded on this site, open the console and execute the following JavaScript inside the console:

$('#breaking-changes a.reference.internal')

This will deliver all 62 breaking changes for version 7.0 as you can see with:

$('#breaking-changes a.reference.internal').length

This is the “CSS” selector to get the necessary information. Same for features:

$('#features a.reference.internal').length

This shows how important markup can be. It’s not only for styling, but also for parsing.

Once we know how to get the necessary information, we can start with scrapy to fetch and store the necessary information for each TYPO3 Version.

Fetch information

The whole python script using scrapy to fetch all information is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import scrapy


class TYPO3VersionSpider(scrapy.Spider):
    name = 'typo3versionspider'
    start_urls = [
        'https://docs.typo3.org/typo3cms/extensions/core/',
        'https://docs.typo3.org/typo3cms/extensions/core/latest/',
    ]

    def parse(self, response):
        for href in response.css('#old-changes a.reference.internal::attr(href)'):
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_version)

    def parse_version(self, response):
        yield {
            'version': response.css('h1::text').extract()[:3],
            'changes': {
                'breaking': len(response.css('#breaking-changes a.reference.internal::text').extract()),
                'deprecation': len(response.css('#deprecation a.reference.internal::text').extract()),
                'feature': len(response.css('#features a.reference.internal::text').extract()),
                'important': len(response.css('#important a.reference.internal::text').extract()),
            }
        }

We will call this spider via CLI:

scrapy runspider typo3Docs.py -o build/typo3Docs.json

This will run the spider and persist the gathered information in build/typo3Docs.json. Scrapy detects that we want to store json, because of the file extension.

Basically we run the spider against the start pages, fetch all urls for the specific versions of branch 7 and 8, and parse the target urls. That’s done inside parse for start_urls. Once we are on the “detail” views with all changes, we will fetch the necessary information in parse_version. Scrapy will format the information and write them to the specified output file.

You can see the initial jQuery calls again on lines 20 to 23 to fetch the different types of changes.

Display gathered information via Highcharts

Once all information are stored in a local file, we can start to display them in a chart, that’s done with the following JavaScript and HTML.

First of all we import all necessary source:

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script src="https://code.highcharts.com/highcharts.js"></script>
<script src="https://code.highcharts.com/modules/exporting.js"></script>

Then we will load the persisted information:

$(function () {
    $.getJSON('./typo3Docs.json', function (json) {
    });
});

Now we can format the information as we need them for highcharts, inside the callback:

json.sort(function(a, b) {
    var version1 = parseFloat(a.version[0].substr(0, 3)),
        version2 = parseFloat(b.version[0].substr(0, 3));
    if (version1 > version2) {
        return 1;
    }
    if (version1 < version2) {
        return -1;
    }
    return 0;
});

var xAxisCategories = [],
    types = {},
    series = [];

$.each(json, function() {
    var version = this.version[0];
    xAxisCategories.push(version);
    $.each(this.changes, function(typeOfChange) {
        // Collect all existing types.
        if (typeof types[typeOfChange] === 'undefined') {
            types[typeOfChange] = [];
        }
    });
});

$.each(json, function() {
    var changesInVersion = this.changes;
    $.each(types, function(type) {
        if (typeof changesInVersion[type] === 'undefined') {
            types[type].push(0);
        } else {
            types[type].push(changesInVersion[type]);
        }
    });
});

$.each(types, function(type) {
    series.push({
        name: type,
        data: this
    });
});

And configure highcharts itself:

$('#container').highcharts({
    chart: {
        type: 'area'
    },
    title: {
        text: $('title').text()
    },
    subtitle: {
        text: 'Source: docs.typo3.org'
    },
    xAxis: {
        categories: xAxisCategories,
        tickmarkPlacement: 'on',
        title: {
            enabled: false
        }
    },
    yAxis: {
        title: {
            text: 'Changes'
        }
    },
    plotOptions: {
        area: {
            stacking: 'normal',
            lineColor: '#666666',
            lineWidth: 1,
            marker: {
                lineWidth: 1,
                lineColor: '#666666'
            }
        }
    },
    series: series
});

In the end we have the following result:

../../../_images/chart.png

The result, our chart with gathered information

Conclusion

If data is structured, it’s easy to fetch them via scrapy. Once information are stored in a JSON file, it’s pretty easy to display it via something like highcharts. Also I’ve done most of the logic via JavaScript you can provide the necessary structure via Python of course.

And please note: It’s just about getting started with both tools and to make documented changes at TYPO3 visible, not about nice code.

Further reading

Check out the used tools:

  • scrapy The python framework to gather information from websites / urls.
  • highcharts The JavaScript library to display charts.
  • Sourcecode The source code is open sourced at Github.
  • The script is run every night and results are available on my site.