Multi-label classification with classifier chains

21 May 2019

In this blog post I will explain multi-label classification, how to collect metadata from articles found in the arXiv database and how to train a Classifier chain to classify such data. After having learned about each of these topics we will put them together into a project!

Multi-label classification

When starting out with machine learning one is typically first exposed to binary classification and eventually multi-class classification problems. An example of the former kind of task is to apply AI to determine whether there is a cat in a given picture. While classifying someone’s age (group) from their picture goes under the latter kind of classification problem. What these two classification problems have in common is that each sample is associated to a single label. In the first example we gave the labels yes and no, and it is not possible for a cat (not belonging to Schrödinger) to both be present and not present in a single picture. In the latter example the labels could for instance be child, teenager, middle-aged and pensioner and it is (pretty much by definition) not possible for any two of these to co-occur.

As you’ve probably guessed by now multi-label classification problems are those classification problems where each sample can be assigned to (potentially) more than one label. Such problems naturally occur in the real world. Indeed, it is possible for a movie to be characterised both as an action movie and a comedy simultaneously, or for a patient to (unfortunately) suffer from more than one disease at a given time.

Notation:

For $n$ different labels $l_1, l_2, …, l_n$ and a sample $s$ I use an n-tuple $(a_1, a_2,..,a_n)$ to denote which labels $s$ is associated with, where $a_i$ will be $1$ if $s$ is associated with label $l_i$ and $0$ otherwise.

I will now explain two simple methods for training a multi-label classifier esentially following the paper “Classifier Chains for Multi-label Classification” by Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank.

Binary relevance (BM)

For a multi-label prolem with $n$ labels say $l_1, l_2, … l_n$ binary relevance (BM) transforms the problem into $n$ binary classification problems (one per label). Hence we train n binary classifiers $C_1, C_2,…, C_n$ where $C_i$ predicts whether a sample $s$ has label $l_i$ and the BM classifier then predicts the union of the individual predictions to be the labels associated to $s$. We can make this more clear using the notation we introduced above: For a sample $s$ we shall let $p_i(s)$ be $1$ if the label $l_i$ was predicted by the corresponding classifier and $0$ otherwise. The BM classifier will then predict that $s$ is associated with the multi-label $(p_1(s), p_2(s), … p_n(s))$.

The BM method has the advantage that it is very easy to implement , but also has the drawback that it does not take label dependence into account during training. For example if we are training a classifier to classify movie genres based on a movie’s title and $P(\text{documentary and science fiction})$ denotes the probability of drawing a movie at random (from the training set) that is both a documentary and science fiction. Then if this probability is not the same as the product $P(\text{documentary}) \cdot P(\text{science fiction})$ then there are dependencies among the labels. It is argued in the literature that it is crucial to take such dependency information into account during training of a multi-label classifier, otherwise the classifier will typically predict too many, too few labels or even combinations that don’t co-occur in reality.

Classifier chains (CC’s)

In their paper Jesse Read et.al. offer a simple way to modify and improve binary relevance such that dependencies among the labels are taken into consideration during training. A classifier constructed by their method is called a classifier chain (abbreviated CC) and the choice of this name will become apparent after we have explained the method:

Given a multi-label classification task with $n$ labels $l_1,…, l_n$ and a set $X$ of training samples (and a set $Y$ of the corresponding multi-labels), we first train a single binary classifier $C_1$ to predict whether a given sample has the label $l_1$. As before we let $p_1(s)$ be $1$ if a sample $s$ was predicted to be associated with $l_1$ and $0$ otherwise. Now for every sample $s$ in $X$ we create a new sample $(s,p_1(s))$ and let $F_1$ be the set of all such samples. We now train a binary classifier $C_2$ on the samples in $F_1$ (to predict if they are associated with $l_2$) and similarly we produce the set of samples $F_2$ (which will be of the form $(s,p_1(s), p_2(s))$). We continue in this way until we have trained $C_n$ on $F_{n-1}$. In this way we get a ‘chain’ of classifiers $C_1, C_2, …, C_n$ which we call the CC and for a sample $s$ the CC will predict that $(p_1(s), p_2(s), …, p_n(s))$ is the associated multi-label.

Implementing CC’s using Keras

In this section I will write a python class that constructs a CC where each binary classifier appearing in the chain is a keras Sequential model. In a later blog post I may do this using low level tensorflow, but let’s keep things simple for now.

Before writing down the code for our class you should be warned that I am going to implement classifier chains in a natural and convenient, but (alas) very memory-inefficient way. The reason being that I don’t plan to discuss garbage collection in tensorflow and python in this post, but after presenting the code I will say a few words that can help you modify the class so that several multi-label problems with more than a few labels can be done on a relatively modest computer.

Here is the code (that does not have to be understood on a first reading):

import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
from scipy.sparse import csr_matrix, hstack, vstack


class ClassifierChain(object):
    """
    Class for implementation of classifier chains using keras.Sequential models.

    Upon construction a list of binary classifiers (self.classifiers) is created. The fit method trains each of the
    classifiers to predict if the sample has the corresponding label (given the predictions of the previous
    classifiers in the chain).

    Methods:
        __init__; Constructor method.

        append_classifier: Add a classifier

        replace_classifier: Replace a classifier

        project_to_binary: Static method used to transform a multi-label to a binary label.

        compute_sample_weights: Static method returning class weights in terms of the samples.

        compute_sample_weights_chain: Static method enabling class-weights to be updated during training of the chain.

        create_batch_generator: Static method returning a python generator of training batches.

        fit_single: Trains a single classifier in the chain.

        predict_single: Static method for binary classification of samples in a sparse matrix.

        generate_next_data_set: Static method generating the data set for the next classifier in the chain.

        fit: Trains the classifier chain.

        predict: Predicts the multi-labels of the samples.

    """

    def __init__(self, classifier, num_labels, name, optimizers=None, losses=None, create_missing=False):
        """
        Constructor method. Creates a chain of classifiers to be used on multi-label classification.

        :param classifier: The first classifier to appear in the CC (classifier chain).

        :param num_labels: The number of labels in the multi-label classification task.

        :param name: The name of the classifier chain. This is used during saving.

        :param optimizers: optional, default = None. List of length num_labels -1 where each entry is a keras optimizer.

        :param losses: optional, default = None. List of length num_labels - 1 where each entry is a loss function.

        :param create_missing: optional, default = False. If set to True classifiers are automatically created
        with configurations inferred from the input classifier. These classifiers are then
        compiled with the optimizer and loss function at the corresponding index in optimizers and losses respectively.

        :type classifier: keras.Sequential model

        :type num_labels: int

        :type name: str

        :type optimizers: list

        :type losses: list

        :type create_missing: bool
        """
        self.classifier = classifier
        self.num_labels = num_labels
        self.classifiers = []
        self.classifiers.append(self.classifier)
        self.classifier_config = self.classifier.get_config()
        self.classifier_input_shape = self.classifier_config['layers'][0]['config']['batch_input_shape']

        self.name = name
        if create_missing:
            self.optimizers = optimizers
            self.losses = losses
            for i in range(1, self.num_labels):
                cfg = self.classifier_config.copy()
                cfg['layers'][0]['config']['batch_input_shape'] = \
                    (self.classifier_input_shape[0], self.classifier_input_shape[1] + i)
                self.classifiers.append(keras.models.Sequential.from_config(cfg))
                self.classifiers[i].compile(optimizer=self.optimizers[i - 1], loss=self.losses[i - 1])

    def append_classifier(self, filename):
        """
        Loads and appends a (possibly previously trained) classifier to self.classifiers.
	
        If len(self.classifiers) = self.num_labels the classifier is not added.

        :param filename: Name of the file that holds the classifier.

        :return: None

        :type filename: str
        """
        if len(self.classifiers) == self.num_labels:
            print('There are already as many classifiers as labels in the chain')
        else:
            self.classifiers.append(keras.models.load_model(filename))

    def replace_classifier(self, i, classifier):
        """
        Replaces the i'th classifier in our chain (we count from 0) with the given input classifier.

        :param i: The index of the classifier to be replaced.

        :param classifier: The replacement classifier.

        :return: None

        :type i: int

        :type classifier: keras Sequential model
        """
        self.classifiers[i] = classifier

    @staticmethod
    def project_to_binary(y, i):
        """
        Converts a multi-label to a binary label.

        If y is a one dimensional array consisting of zeros and ones. Then if the i'th entry equals 1
        this method returns the array ([1,0]) and otherwise it returns ([0,1]). For two-dimensional input this
        carried out row wise.

        :param y: Multi-labels.

        :param i: Index to consider.

        :return: Binary label.

        :type y: ndarray, shape = (num_samples, self.num_labels),

        :type i: int, 0 <= i <= self.num_labels - 1

        :rtype: ndarray, shape = (num_samples, 2)
        """
        ones = np.ones(y.shape[0])
        zeros = np.zeros(y.shape[0])
        M_1 = np.column_stack((ones.T, zeros))
        M_2 = np.column_stack((zeros.T, ones))
        y_i = y[:, i]
        d_i = ones - y_i
        y_ret = (M_1.T * y_i).T + (M_2.T * d_i).T
        return y_ret

    @staticmethod
    def compute_sample_weights(y, debalancing=0):
        """
        Returns class weights in terms of sample weights.

        :param y: Multi-labels.

        :param debalancing: optional, default = 0. Term to add more importance to frequently occurring labels.

        :return: Sample weights.

        :type y: ndarray, shape (num_samples, num_labels)

        :type debalancing: float

        :rtype ndarray, shape = (num_samples,)
        """
        unique_rows, inverse, counts = np.unique(y, axis=0, return_inverse=True, return_counts=True)
        weights = np.array([float(y.shape[0]) / float(counts[m]) + debalancing for m in inverse])
        weights = weights / np.amax(weights)
        return weights


    @staticmethod
    def compute_sample_weights_chain(X, y, preds_start_index, debalancing=0):
        """
        This method allows our classifier chain to automatically infer class-weights during training (in chain mode).

        The method is roughly implemented
        as follows: The i'th classifier in the chain receives a sample together with the predictions of the i-1 former
        classifiers. For each sample we extract the tuple of predictions and append the corresponding label, i.e.
        ([1,0]) or ([0,1]). We then count the number of times each unique tuple occurs and weight the corresponding
        samples by the rule total_num_samples/occurrences * (scaling factor). The debalancing parameter allows
        for giving some additional weight to (all samples) so that more frequently occurring labels get some more
        importance during training.

        :param X: Sparse matrix where the columns up to preds_start_index are features and remaining columns
        indicate which label was predicted by each (previous) binary classifier in the chain.

        :param y: The sample labels.

        :param preds_start_index: Indicates the index of the first column in X where predictions have been made.

        :param debalancing: Adds importance to more frequent labels (during training).

        :return: Sample weights

        :type X:csr-matrix, shape= (num_samples, num_features + num_predicted_labels)

        :type y: ndarray, shape = (nun_samples, 2)

        :type preds_start_index: int

        :type debalancing: float

        :rtype: ndarray, shape = (num_samples,)
        """
        y = csr_matrix(y)
        data = hstack([X, y]).tocsr()
        preds_and_label = data[:, preds_start_index:].toarray()
        weights = ClassifierChain.compute_sample_weights(preds_and_label, debalancing=debalancing)
        return weights

    @staticmethod
    def create_batch_generator(
            X, y_input, batch_size=32, shuffle=False, weights_mode=None, predefined_weights=None,
            preds_start_index=None, debalancing=0):
        """
        Creates a python generator of training batches.

        :param X: Sparse matrix of samples (and possibly labels predicted previously in the chain).

        :param batch_size: optional,default = 32. Number of samples in each batch to be yielded from the created
        generator.

        :param shuffle: optional, default = False. Whether to shuffle the samples before creating the generator.

        :param weights_mode: optional, default = None. If this input is 'chain' then samples are automatically
         weighted by the method compute_sample_weights_chain.

        :param predefined_weights: optional, default = None. Weights for the samples.
        This argument is only considered if weights_mode is None.

        :param preds_start_index: optional, default = None. Only relevant if weights_mode is set to 'chain'
        in which case this parameter refers to the index of the first column corresponding to previous predictions.

        :param debalancing: optional, default = 0. Only relevant if weights_mode is chain in which case this parameter
        adds importance to frequently occurring labels.

        :return: A batch generator

        :type X: csr-matrix, shape = (num_samples, num_features + (possibly previously predicted labels)

        :type y_input: ndarray, shape = (num_samples, 2)

        :type shuffle: bool

        :type weights_mode: str

        :type predefined_weights: ndarray, shape = (num_samples,)

        :type preds_start_index: int

        :type debalancing: float

        :rtype: generator

        """
        X_copy = X.copy()
        y_copy = csr_matrix(np.copy(y_input))
        if weights_mode == 'chain':
            if preds_start_index is None:
                preds_start_index = X_copy.shape[1]

            weights = ClassifierChain.compute_sample_weights_chain(
                X_copy, y_copy, preds_start_index=preds_start_index, debalancing=debalancing)
            weights = csr_matrix(weights)

        elif predefined_weights is not None:
            weights = csr_matrix(predefined_weights)
        else:
            weights = csr_matrix(np.ones(X.shape[0]))
        if shuffle:
            data = hstack([X_copy, y_copy, weights.transpose()])
            data = data.tocsr()
            row_indices = np.arange(X_copy.shape[0])
            np.random.shuffle(row_indices)
            data = data[row_indices]
            X_copy = data[:, :-3]
            y_copy = data[:, -3:-1]
            weights = data[:, -1].toarray().flatten()

        for i in range(0, X.shape[0], batch_size):
            X_batch = X_copy[i: i + batch_size, :].toarray()
            y_batch = y_copy[i: i + batch_size, :].toarray()
            weights_batch = weights[i: i + batch_size]

            if weights_mode == 'chain':

                yield (X_batch, y_batch, weights_batch)

            elif predefined_weights is not None:
                yield (X_batch, y_batch, weights_batch)

            else:
                yield (X_batch, y_batch)

    def fit_single(
            self, i, X, y, epochs=1, batch_size=32, verbose=1, shuffle=True, weights_mode=None,
            predefined_weights=None, preds_start_index=None, debalancing=0, save_after=None):
        """
        Train a single classifier in the chain.

        :param i: Number of the classifier in the chain (we count from 0)

        :param X: Samples (with possibly previous predictions appended as columns)

        :param y: class labels

        :param epochs: optional, default=1. Number of training epochs.

        :param batch_size: optional, default = 32. Number of samples per gradient update.

        :param verbose: optional, default = 1. This argument is passed to classifier.fit_generator

        :param shuffle: optional, default True. Whether to shuffle the training set at the start of each epoch.

        :param weights_mode: optional, default None. If set to 'chain' samples will automatically be weighted by the
        method self.compute_sample_weights_chain.

        :param predefined_weights: optional, default = None. Weights for the samples. These weights will
        only be considered if weights_mode is None.

        :param preds_start_index: optional, default = None. Only relevant if weights_mode is chain in which case it
        indicates the index of the first column of X corresponding to previous predictions.

        :param debalancing: optional, default 0. Only relevant if weights_mode is chain in which case
        the parameter adds importance to frequently occurring labels.

        :param save_after: optional, default None. If set to 'epoch classifier' i will be saved in its entirety
        after every epoch. If set to 'completion' we save the classifier after the method is complete.


        :type i: int

        :type X: csr_matrix, shape= (num_samples, num_features + (possibly previous predictions))

        :type y: ndarray, shape= num_samples, 2)

        :type epochs: int

        :type batch_size: int

        :type verbose: int

        :type shuffle: bool

        :type weights_mode: str

        :type predefined_weights: ndarray, shape = (num_samples,)

        :type preds_start_index: int

        :type debalancing: float

        :type save_after: str


        :return: None
        """

        steps_per_epoch = int(np.ceil(X.shape[0] / batch_size))
        y_input = self.project_to_binary(y, i)
        # todo: rewrite the batch generator so it restarts when all batches have been yielded, in that way we don't
        #  need to loop over the epochs, and can use the epochs parameter in keras.Sequential.fit_generator instead.
        for epoch in range(epochs):
            print('Training classifier %d: Epoch %d/%d' % (i, epoch + 1, epochs))
            batch_generator = self.create_batch_generator(
                X, y_input, batch_size=batch_size, shuffle=shuffle, weights_mode=weights_mode,
                predefined_weights=predefined_weights, preds_start_index=preds_start_index,
                debalancing=debalancing)

            self.classifiers[i].fit_generator(
                generator=batch_generator, steps_per_epoch=steps_per_epoch, verbose=verbose)
            if save_after == 'epoch':
                self.classifiers[i].save('%s_classifier_%s_epoch_%s.h5' % (self.name, str(i), str(epoch)))
        
        if save_after == 'completion':
            if epochs > 0:
                self.classifiers[i].save('%s_classifier_%s_epochs_%s.h5' % (self.name, str(i), str(epochs)))

    @staticmethod
    def predict_single(X, classifier, batch_size=32):
        """
        The input classifier makes predictions based on the input csr-matrix X.

        :param X: sparse matrix of features (and possibly previously predicted labels)

        :param classifier: The classifier that is to perform the predictions.

        :param batch_size: optional, default = 32. How many samples to predict at a time. All predictions will be
        accumulated in an ndarray.

        :return: one dimensional ndarray of predictions (consisting of zeros and ones)

        :type X: csr-matrix, shape = (num_samples, num_features + (possibly previously predicted labels))

        :type classifier: keras Sequential model.

        :type batch_size: int

        :rtype ndarray, shape = (num_samples,)
        """
        y = np.zeros(X.shape[0])
        for i in range(0, X.shape[0], batch_size):
            X_batch = X[i: i + batch_size, :].toarray()
            y[i: i + batch_size] = (np.argmax(classifier.predict(X_batch, batch_size=batch_size), axis=1) == 0).astype(int)

        return y

    @staticmethod
    def generate_next_data_set(X, classifier, batch_size=32):
        """
         Creates the next sample in the chain classifier.

        :param X: sparse matrix of samples (and possibly previously predicted labels).

        :param classifier: The classifier used to predict from X thus generating the next data set.

        :param batch_size: optional, default = 32. The number of samples to predict at a time.

        :type X: csr-matrix, shape=(num_samples, features + (possibly previously predicted labels))

        :type classifier: keras Sequential model

        :type batch_size: int
            optional, default = 32
        :rtype : csr-matrix, shape = (X.shape[0], X.shape[1] + 1)
        """
        preds = csr_matrix(ClassifierChain.predict_single(X=X, classifier=classifier, batch_size=batch_size))
        X_next = hstack([X, preds.transpose()]).tocsr()
        return X_next

    def fit(
            self, X, y, epochs=1, batch_size=32, verbose=1,
            weights_mode=None, predefined_weights=None, debalancing=0, shuffle=True, save_after='classifier'):
        """
        Trains all classifiers in the chain

        :param X: Sparse matrix of samples .

        :param y: The sample labels.

        :param epochs: optional, default = 1. Either a list of integers where entry i corresponds to the number of
        epochs the i'th classifier is to be trained on, or an integer n in which case all classifiers
        will be trained on n epochs.

        :param batch_size: optional, default = 32. Number of samples per gradient update (for each of the classifiers).

        :param verbose: optional, default = 1. This parameter is eventually passed to each classifiers fit_generator
        method.

        :param weights_mode: optional, default = None. If set to 'chain' samples will automatically be weighted
        by the method self.compute_sample_weights_chain

        :param predefined_weights: optional, default = None. List of weights for the samples trained per classifier
        (The entries of the list must be an ndarray of shape = (num_samples,)). If the length of this list is less than
        self.num_labels the remaining classifiers are trained on automatically weighted samples if weights_mode is
        'chain'. Otherwise these samples will not be weighted.

        :param debalancing: optional, default = 0. Either a List of length self.num_labels containing float values or a
        single float value. In the latter case a list of floats equal to this given value will be considered.
        Only relevant if weights_mode is 'chain' in which case the i'th entry of the list corresponds to adding
        importance to frequently occuring samples while training the i'th classifier in the chain.

        :param shuffle: optional, default = True. Whether to shuffle the samples at the beginning of every epoch.

        :param save_after: optional, default = 'classifier'. If set to 'classifier' we make each classifier save after
        it is finished training. If set to 'epoch' it saves each classifier after every epoch the classifier has run.

        :type X: csr_matrix, shape = (num_samples, num_features)

        :type y: ndarray, shape= (num_samples, num_labels)

        :type epochs: int or list

        :type batch_size: int

        :type verbose: int

        :type weights_mode: str

        :type predefined_weights: list

        :type debalancing: float or list

        :type save_after: str
        :
        """
        if len(self.classifiers) < self.num_labels:
            raise IndexError('Method fit can only be run when there are as many classifiers as labels')
        if type(epochs) == int:
            epochs_list = self.num_labels * [epochs]
        else:
            epochs_list = epochs

        if type(debalancing) == int:
            debalancing_list = self.num_labels * [debalancing]
        else:
            debalancing_list = debalancing

        F = X.copy()
        preds_start_index = X.shape[1]
        for i in range(self.num_labels):
            if save_after == 'classifier':
                save_after = 'completion'
            elif save_after == 'epoch':
                save_after = 'epoch'

            current_mode = weights_mode
            current_weights = None

            if predefined_weights is not None:
                if i < len(predefined_weights):
                    current_weights = predefined_weights[i]
                    current_mode = None
                else:
                    predefined_weights = None

            self.fit_single(
                i=i, X=F, y=y, epochs=epochs_list[i], batch_size=batch_size, shuffle=shuffle,
                verbose=verbose, weights_mode=current_mode, predefined_weights=current_weights,
                preds_start_index=preds_start_index, debalancing=debalancing_list[i], save_after=save_after)

            if i == self.num_labels - 1:
                break

As mentioned the code is not memory efficient. Indeed if num_labels is large the class will then hold many classifiers which all may eat up a lot of memory! Hence if you want to train a CC on many labels I suggest altering the fit method so that after training and saving a classifier to file (and constructing the dataset for the next classifier) the classifier is removed from memory before training the next one. Also don’t forget to modify the predict method accordingly!

Note also that the class contains some sample weighting methods. The reason being that multi-labeled data sets can typically have rather imbalanced (multi-)label presence. Indeed it is expected that in many such problems that samples associated to very many labels occur less frequently, and altough the CC is designed to “understand” this, we also don’t want it to fail classifying the minority class. The idea behind the weights_mode = 'chain' argument in the fit method is to apply more class weighting the further out in the chain we go as the minority class becomes more “visible” at this point. For example if we are training a CC to classify music genre then if both jazz and blues have been predicted as genre’s it is less likely that the sample will also be associated to say heavy metal by the third classifier in the chain. However such music can exist, and the sample weighting is then there to ensure that the heavy metal classifier pays some more attention to these (relatively) few samples.

Now that we have our classifier chain implementation we can attempt a multi-label classification problem. To keep things simple we will consider the most basic kind of such problems, that is the case of two labels. Ideally we want a dataset that is easy to acquire, whose samples are associated to at most two labels that can and do co-occur. The first dataset satisfying these requirements that springs to (my) mind is the following: “Titles of papers labelled either algebraic geometry or number theory”.

What is algebraic geometry?

In short algebraic geometry is (classically) the study of the geometry of polynomial equations. That is one studies the set of zeros of polynomials in several variables. For example the polynomial $y -x^2$ is a polynomial in two variables and you can convince yourself that the solutions to the equation $y -x^2 = 0$ looks like a parabola! Note that here we started out with something of an algebraic nature (a polynomial or an equation) and ended up with a geometrical object. In modern algebraic geometry this idea is extended such that more general algebraic objects (including the set of integers) can be associated with a geometrical object. In this way one can speak of the set of prime numbers as a curve!

Despite being a research discipline in its own right, algebraic geometry has shed light on many problems in Number theory. In fact one of the most famous problems in number theory, known as Fermat’s last theorem, was first successfully proved in this way.

Fetching our dataset from the arXiv

The arXiv is a database containing thousands of papers from many scientific disciplines. Since arXiv permits its metadata to be harvested through the OAI-PMH interface, it gives us a pretty much ideal method for collecting our desired dataset.

The following blog post provides a function harvest that collects pretty much all metadata about the articles (modified between specified dates) found on the experimental physics arXiv and stores this information in a pandas dataframe.

We will now write a python class with a similar harvesting method called fetch and a method transform_spread_labels that essentially provides the multi-labels (the categories i.e. Algebraic geometry, Number theory, etc. of a (mathematics) paper) in the format/notation we use.

Here is the code:

# We use the requests HTTP library
import requests

# The Selector class from the Parsel library allows for extraction of data from HTML and XML using XPath selectors
from parsel import Selector

# We will use pandas to store our data in a database
import pandas as pd

# It is sometimes convenient to work with today's date which we can access from the datetime library
from datetime import date

# arXiv uses 503 Retry-After replies to implement flow control, we will use the time library to abide to these requests
import time

import numpy as np


class ArXivFetcher(object):
	"""
	Class for harvesting metadata from arXiv with the help of OAI.

	The class holds a pandas DataFrame named df that can be updated to store meta data collected from the arXiv database.

	Methods:

		__init__: Constructor method

		fetch: Fetches data from arXiv and appends it to our database.

		append_today: Fetches data from today's arXiv papers and appends the data to our data base.

		fetch_categories: Updates the categories associated to our science of interest.

		export_data_to_csv: Exports our database to csv.

		export_categories_to_csv: Exports the category tags and associated names to csv.

		load_data_from_csv: Loads previously harvested data.

		transform_spread_labels: Spreads out the content of the categories column to a column for each category.

		return_titles_categories: Returns a pandas dataframe of harvested titles and categories.
	"""

	def __init__(self, science):
		"""
		Constructor method.

		:param science: Name of the science to harvest arXiv metadata from. Examples include: math, cs, nlin, q-bio,
		etc. See the section named Sets at https://arxiv.org/help/oa for further explanation.

		:type science: str
		"""
		self.base_url = 'http://export.arxiv.org/oai2?verb=ListRecords'
		self.science = science
		self.meta_data = 'metadataPrefix=arXiv'
		self.df = pd.DataFrame(columns=('Titles', 'Created', 'Categories'))
		self.categories_in_science = {}
		if self.science == 'math':
			self.fetch_categories()
		self.transformed = False

	def fetch(self, start_date=None, end_date=None):
		"""
		Fetches data from arXiv in a specified time interval and appends it to our database.

		This method is similar to the method harvest found at http://betatim.github.io/posts/analysing-the-arxiv/.

		:param start_date: We collect data from arXiv papeers starting from this date which must be supplied
		in (extended) iso format.

		:param end_date: We fetch data from arXiv papers published before this date (given in (extended) iso format).

		:type start_date: str

		:type end_date: str

		"""
		url = self.base_url + '&' + 'set=' + self.science + '&' + self.meta_data

		if start_date is not None and end_date is not None:

			print(
				'Fetching data from articles in %s modified on the arXiv in the period: %s to %s'
				% (self.science, start_date, end_date))

			url += '&from=' + start_date + '&until=' + end_date

		else:
			print('Fetching data from all articles in %s published on arXiv')

		page = requests.get(url)
		selector = Selector(text=page.text, type='xml')
		# The data we want to extract are xml elements registered under namespaces that we now specify
		ns = {'arXiv': 'http://arxiv.org/OAI/arXiv/', 'oai': 'http://www.openarchives.org/OAI/2.0/'}

		# The OAI-PMH framework implements flow control by means of not necessarily providing the complete list we are
		# requesting as a response. Whenever this occurs the response will contain a "resumptionToken" that can be
		# given as an argument to our next request in order to receive more of the list. The resumptionToken may have
		# an argument specifying the complete list size (i.e. how many article titles appear in the complete list
		list_size = selector.xpath('//oai:resumptionToken/@completeListSize', namespaces=ns).get()
		if list_size is not None:
			print('There is data from %s articles to be collected' % list_size)
			list_size = int(list_size)

		# We shall use a loop to obtain the data from the full list and we keep track of how many articles we have
		# collected data from along the way
		counter = 0
		while True:
			if counter != 0:
				try:
					page = requests.get(url)
					# We raise the stored HTTPError if one occurred.
					page.raise_for_status()

				except requests.HTTPError as err:
					if err.response.status_code == 503:
						# The text of err.response tells us how long to wait before retrying.
						error_selector = Selector(text=err.response.text, type='html')
						text = error_selector.xpath('//h1/text()').get()
						# text is of the form Retry after n seconds. We find n:
						timeout = int(text.split()[-2])
						print('Got 503 will retry to connect in %d seconds' % timeout)
						time.sleep(timeout)
						continue
					else:
						raise
			if counter != 0:
				selector = Selector(text=page.text, type='xml')
			# We now collect and append data (titles, date of creation and categories) to our database
			titles = selector.xpath('//arXiv:title/text()', namespaces=ns).getall()
			created = selector.xpath('//arXiv:created/text()', namespaces=ns).getall()
			categories = selector.xpath('//arXiv:categories/text()', namespaces=ns).getall()
			temp_df = pd.DataFrame({'Titles': titles, 'Created': created, 'Categories': categories})
			self.df = self.df.append(temp_df, ignore_index=True)
			counter += len(titles)
			# If we can find a resumptionToken then there is still more data to be gathered.
			resumption_Token = selector.xpath('//oai:resumptionToken/text()', namespaces=ns).get()
			if resumption_Token is not None:
				# We (try to) read in the page again but this time with the last received resumptionToken as an argument
				url = self.base_url + '&' + 'resumptionToken=' + resumption_Token
				print('We have so far collected data from %d' % counter)
				print('There are still %d more articles to harvest data from' % (list_size - counter))

			else:
				print('All the data has been collected from the %d articles requested' % counter)
				break

	def append_today(self):
		"""
		Fetches and appends data from today's arXiv papers to our database.

		"""
		# We find today's date using the datetime package
		today = date.today().isoformat()
		if self.transformed:
			# self.df has been transformed and will therefore not have the categories column
			temp_df = self.df.copy()
			self.df = pd.DataFrame(columns=('Titles', 'Created', 'Categories'))
			self.fetch(start_date=today, end_date=today)
			self.transform_spread_labels()
			self.df = temp_df.append(self.df, ignore_index=True)
		else:
			self.fetch(start_date=today, end_date=today)

	def fetch_categories(self):
		"""
		Updates the categories associated to our discipline (science) of interest.

		Due to how arXiv describes subject classes it takes some work to implement this method
		for all disciplines, and we have therefore so far only implemented it in the case science=math

		"""

		if self.science is not 'math':
			raise NotImplementedError('method fetch_categories has (so far) only been implemented for science=math')

		print('fetching categories')

		url = 'https://arxiv.org/archive/math'
		page = requests.get(url)
		selector = Selector(text=page.text, type='html')
		# We use Xpath to create a list of categories found on the web page.
		categories = selector.xpath('//li/b/text()').getall()
		# We remove the few items with the <li><b> tags that are not categories.
		i = 0
		while i < len(categories):
			if categories[i].startswith('math') is False:
				del categories[i]
			else:
				i += 1
		# Now we update our dictionary (self.tags).
		for entry in categories:
			separator = entry.find('-')
			category = entry[:separator - 1]
			name = entry[separator + 2:]
			self.categories_in_science[name] = category

	def export_data_to_csv(self, filename='arXiv_data.csv'):
		"""
		Export harvested data to csv.

		:param filename: The name of the csv file to be created.

		:type filename: str
		"""
		self.df.to_csv(filename, index_label=False, index=False)

	def export_categories_to_csv(self):
		"""
		Create a csv file containing the category tags and the corresponding names.
		:return: None
		"""
		temp_df = pd.DataFrame.from_dict(self.categories_in_science, orient='index')
		temp_df.to_csv('categories.csv', index_label=False)

	def load_data_from_csv(self, filename, transformed):
		"""
		Loads previously harvested data.

		warning:: This method replaces our current database.

		:param filename: The csv file to load data from.

		:param transformed: If the data to be loaded has previously been transformed by self.transform_spread_labels
		then transformed should be set to True. In this case the method sets self.Transformed to True.


		:type filename: str

		:type transformed: bool



		"""
		self.df = pd.read_csv(filename)
		if transformed:
			self.transformed = True

	def transform_spread_labels(self):
		"""
		Spreads out the content of the categories column to a column for each category in self.science.

		self.df is transformed to a pandas dataframe where the categories column is replaced with one column for every
		category in self.science. Categories not belonging to self.science will no longer be visible.
		This method changes the status of self.transformed to True.

		"""
		if self.transformed:
			print('self.df has already been transformed')
			return

		# Create a temporary pd.DataFrame with all entries set to False
		temp_df = pd.DataFrame(np.zeros((self.df.shape[0], len(self.categories_in_science.values()))))
		# Rename the columns to the categories associated to our science of choice.
		temp_df.columns = self.categories_in_science.values()
		# Add all the categories as new columns filled with False entries to self.df
		self.df = pd.concat([self.df, temp_df], axis=1)

		# For each row and each category we change the value in the category column if this category is
		# contained in the Categories column of that row.
		for i, row in self.df.iterrows():
			for category, column in temp_df.iteritems():
				if category in self.df.at[i, 'Categories']:
					self.df.at[i, category] = 1

		# Finally we remove the Categories column of self.df
		del self.df['Categories']
		self.transformed = True

	def return_titles_categories(self):
		"""
		Returns a pandas dataframe of harvested titles and categories.

		:return: A copy of self.df with the Created column removed and no duplicate titles.

		:rtype: pandas DataFrame
		"""
		return_df = self.df.copy()
		del return_df['Created']
		return_df.drop_duplicates(subset='Titles')
		return return_df

To collect our dataset we will first use our class to collect titles and associated categories of all maths papers modified on the arXiv between 27.04.2009 and 26.04.2019 and store this data as a csv file:

harvester = ArXivFetcher('math')
harvester.fetch(start_date='2019-04-27', end_date='2019-04-26')
harvester.transform_spread_labels()
data_set = harvester.return_titles_categories()
data_set.to_csv('titles_and_categories.csv', index_label=False, index=False)

Now we write a script that reads the titles_and_categories.csv file into a pandas dataframe, extracts the titles labelled as algebraic geometry or number theory (or both) into a new pandas dataframe and performs a brief analysis of our new dataset before storing it to csv.

"""
This script extracts the titles labeled either math.AG (Algebraic geometry) or math.NT (Number theory) from the file
titles_and_categories.csv and stores this data in a pandas dataframe that we later save as AGNT. We also briefly analyse
 the content of our dataframe'.
"""
import pandas as pd

df = pd.read_csv('titles_and_categories.csv')
# Extract the rows that were labeled with either math.AG (Algebraic geometry) or math.NT (Number theory)
AGNT = df[(df['math.AG'].values == 1) | (df['math.NT'].values == 1)]
# Remove all columns except 'Titles', 'math.AG', and 'math.NT'
AGNT = AGNT.drop(columns=[c for c in list(AGNT.columns) if c not in ['Titles', 'math.AG', 'math.NT']])

# Find the rows of AGNT where math.AG (Algebraic geometry) is labeled.
AG = AGNT[(AGNT['math.AG'].values == 1)]
# Find the rows of AGNT where math.NT (Number theory) is labeled.
NT = AGNT[(AGNT['math.NT'].values == 1)]

# The rows where both math.AG and math.NT are labeled.
AG_and_NT = AGNT[(AGNT['math.AG'].values == 1) & (AGNT['math.NT'].values == 1)]
# The rows where math.AG is labeled but not math.NT
AG_not_NT = AGNT[(AGNT['math.NT'].values == 0)]
# The rows where math.NT is labeled but not math.AG
NT_not_AG = AGNT[(AGNT['math.AG'].values == 0)]

# Let us see how many items each of these pandas dataframes contain:

total = int(AGNT.shape[0])

num_AG = int(AG.shape[0])

num_NT = int(NT.shape[0])

num_AG_and_NT = int(AG_and_NT.shape[0])

num_AG_not_NT = int(AG_not_NT.shape[0])

num_NT_not_AG = int(NT_not_AG.shape[0])
print('Number of articles tagged either AG or NT (or both): %d' % total)
print('number of articles tagged AG: %d' % num_AG)
print('number of articles tagged NT: %d' % num_NT)
print('number of articles tagged AG and NT: %d' % num_AG_and_NT)
print('number of articles tagged AG but not NT: %d' % num_AG_not_NT)
print('number of articles tagged NT but not AG: %d' % num_NT_not_AG)

# We compute the corresponding probabilities:
pr_AG = num_AG/total
pr_NT = num_NT/total
pr_AG_and_NT = num_AG_and_NT/total 
pr_AG_not_NT = num_AG_not_NT/total
pr_NT_not_AG = num_NT_not_AG/total
pr_AG_given_NT = num_AG_and_NT/num_NT
pr_not_AG_given_NT = 1 - pr_AG_given_NT
pr_NT_given_AG = num_AG_and_NT/num_AG
pr_not_NT_given_AG = 1 - pr_NT_given_AG
print('probability of math.AG = %f' % pr_AG)
print('probability of math.NT = %f' % pr_NT)
print('product of these probabilities is: %f' % (pr_AG * pr_NT))
print('probability of math.AG and math.NT is: %f' % pr_AG_and_NT)

print('probability of AG but not NT: %f' % pr_AG_not_NT)
print('probability of NT but not AG: %f' % pr_NT_not_AG)
print('probability of AG given NT: %f' % pr_AG_given_NT)
print('probability of NT given AG: %f' % pr_NT_given_AG)
print('probability of not NT given AG: %f' % pr_not_NT_given_AG)
print('probability of not AG given NT: %f' % pr_not_AG_given_NT)


AGNT.to_csv('AGNT.csv', index_label=False, index=False)

Running the script gives the following output:

Number of articles tagged either AG or NT (or both): 45408
number of articles tagged AG: 26793
number of articles tagged NT: 21757
number of articles tagged AG and NT: 3142
number of articles tagged AG but not NT: 23651
number of articles tagged NT but not AG: 18615
probability of math.AG = 0.590050
probability of math.NT = 0.479145
product of these probabilities is: 0.282719
probability of math.AG and math.NT is: 0.069195
probability of AG but not NT: 0.520855
probability of NT but not AG: 0.409950
probability of AG given NT: 0.144413
probability of NT given AG: 0.117269
probability of not NT given AG: 0.882731
probability of not AG given NT: 0.855587

Hence our dataset consists of 45408 samples, where 26793 of them were labelled as algebraic geometry and 21757 of them are categorised as number theory, while only 3142 are labelled both as number theory and algebraic geometry. Furthermore we also notice that there are dependencies among these labels.

Vectorisation of the titles

In order to train a classifier to predict labels based on the titles of papers we need to transform these titles into vectors containing numerical values. A typical way to transform a set of documents into vectors is to keep a dictionary of the vocabulary of the training set and use that to map the words into indices. Another method is to use the so called hashing trick where we let a hashing function convert the words into integers for us. In this way we do not need to keep updating the vocabulary every time we want to train our model on new training sets, but we may have collisions. That is different words mapping to the same hashing value. If we increase the dimension of the hashing space (meaning we increase the number of features in our (transformed) labels) then collisions are less likely to occur.

Since the arXiv is updated regularly, and we may want to train our CC on new samples in the future, we decide to go with the hashing trick when converting our titles to vectors. In fact we will write a class MathHashingVectorizer taylored for vectorisation of titles of math papers. The class keeps a list stop containing words from mathematics, English and French that are so standard (accross all areas of mathematics) that they are unlikely to give any useful information about sample labels. The vectorize method of our class removes the words in stop from all the titles it is fed and transforms the result into a sparse matrix where the rows corresponds to the titles. The class also comes with some stemming functionality, but we will not apply this in our example.

Here is the code:

import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
import mmh3
from nltk.stem.porter import PorterStemmer
# stopwords can be downloaded by importing nltk to python and then executing the command nltk.download('stopwords')
from nltk.corpus import stopwords
# We will use itertools to efficiently flatten lists of lists
import itertools
# Due to the huge amount of data we will need sparse matrices
from scipy.sparse import csr_matrix


class MathHashingVectorizer(object):
    """Class for transformation of (mathematical) text to vectors with numerical entries.

        The typical use of this class is to transform texts by first removing certain stopwords stored in (self.stop)
        and then turning the result into sparse vectors that can be used as samples when training a classifier.

        Methods:
            __init__: Constructor method.

            seeded_mmh3: Seeded hashing function based on Murmurhash3.

            update_stopwords: Append stopwords.

            remove_from_text: Removes all stopwords from the input text.

            porter_stem: Applies Porter stemming to the input text.

            stem: Applies Porter stem to all text not in math mode.

            encode: Returns an ndarray where entries correspond to hashing values of the input text.

            vectorize_encoded: Transforms a pandas Series of encoded values to a sparse matrix.

            vectorize: Transforms a pandas Series of texts to a sparse matrix suitable for training a classifier.
    """

    def __init__(self, random_seed=123, n=2**21):
        """
        Constructor method.

        :param random_seed: set a random seed for reproducibility.
        :param n: The dimension of the hashing space, or in other words the number of features.

        :type random_seed: int
        :type n: int
        """
        self.random_seed = random_seed
        self.n = n
        self.math_stopwords = ['proof', 'theorem', 'proposition', 'definition', 'lemma', 'counterexample']
        self.math_stopwords += ['counterexamples', 'conjecture', 'conjectures', 'proofs', 'definitions']
        self.math_stopwords += ['theorems', 'propositions', 'lemmas']
        self.math_stopwords += ['Generalization', 'result', 'generalization', 'appendix', 'corollary']
        self.math_stopwords += ['generalizations', 'generalisation', 'generalisations']
        self.stop = stopwords.words('english') + stopwords.words('french') + self.math_stopwords
        self.porter = PorterStemmer()

    def seeded_mmh3(self, x):
        """
        Seeded non-cryptographic hash function found in the mmh3 library (a python wrapper for Murmurhash 3).
        :param x: text
        :return: hash value

        :type x: str
        :rtype: int
        """
        return mmh3.hash(x, seed=self.random_seed)

    def update_stopwords(self, x):
        """
        Adds stopwords.
        :param x: List of strings to be appended to self.stopwords

        :type x: list
        """
        self.stop += x

    def remove_from_text(self, text):
        """
        Removes all stopwords from the given string.
        :param text: Text
        :return: Text with all words contained in self.stopwords removed.

        :type text: str
        :rtype: str
        """
        text = ' '.join(i for i in text.split() if i.lower() not in self.stop)
        return text

    def porter_stem(self, text):
        """
        stems the words appearing in the input text using PorterStem()

        :param text: Text to be stemmed.
        :return: Stemmed text.

        :type text: str
        :rtype: str
        """
        stem_split = [self.porter.stem(word) for word in text.split()]
        return ' '.join(word for word in stem_split)

    def stem(self, text):
        """
        This method applies PorterStemmer to all text not in math_mode (that is not wrapped in $ symbols).

        ::warning: This method has not been tested yet.

        :param text: Text to be transformed
        :return: Stemmed text

        :type text: str
        :rtype: str
        """

        print('This is an early version of stem and we advice caution')
        if text == '':
            return ''

        if text.count('$') % 2 != 0:
            print('text contains an odd number of $ symbols and we will treat it wit porter_stem')
            return self.porter_stem(text)

        txt = text
        transformed_text = ''
        while len(txt) > 0:
            first_occurrence = txt.find('$')
            second_occurrence = txt.find('$', first_occurrence + 1)
            if first_occurrence != 0:
                transformed_text += self.porter_stem(txt[: first_occurrence]) + ' ' + \
                                    txt[first_occurrence: second_occurrence + 1] + ' '
                txt = txt[second_occurrence + 1:]
            else:
                transformed_text += txt[:second_occurrence + 1] + ' '
                txt = txt[second_occurrence + 1:]

        return transformed_text

    def encode(self, text, max_words=40, hash_function=None, stemming=False):
        """
        Removes stopwords from a text and encodes the result into an array of integers using a hashing function.

        :param text: Text to be encoded.

        :param max_words: Maximum number of words in the string to be considered.

        :param hash_function: optional, default = None. A hashing function. If no hashing function is provided
        self.seeded_mmh3 will be applied.

        :param stemming: optional, default = False. If set to True we stem the words in the text (using self.stem)
        before encoding.

        :return: ndarray of hashing values.

        :type text: str

        :type max_words: int

        :type hash_function: function

        :type stemming: bool

        :rtype: ndarray, shape = (max_words,)
        """

        text = self.remove_from_text(text)
        if stemming:
            text = self.stem(text)

        filters = '!"$\n \t'

        if hash_function is None:
            hash_function = self.seeded_mmh3

        pre_vec = keras.preprocessing.text.hashing_trick(
            text=text, n=self.n, hash_function=hash_function, lower=False, filters=filters)

        vec = pre_vec[:max_words]

        if len(vec) < max_words:
            m = max_words - len(vec)
            extension = ((self.n) * np.ones(m)).astype(int)
            vec = np.concatenate((vec, extension))

        return vec

    def vectorize_encoded(self, df, max_words=40, mode='binary'):
        """
        Transforms a pandas series of ndarray's to a csr-matrix.

         The non-zero entries of each row in the returned sparse matrix corresponds to the values in the corresponding
         ndarray of hashing values.

        :param df: Series where each entry is an ndarray of shape (max_words,) .

        :param max_words: optional, default = 40. Maximum number of words to consider.

        :param mode: optional, default = 'binary'. If set to 'binary' a non-zero entry is set to 1 if the index of the
        column (in the given row) appeared in the corresponding ndarray . If mode is set to 'count' then an entry d in
        a column means that the index of this column appeared d times in the corresponding ndarray.

        :return: Sparse matrix

        :type df: pandas Series

        :type max_words: int

        :type mode: str

        :rtype: csr_matrix, shape = (df.shape[0], self.n)
        """
        # We will construct a csr_matrix from (data, row_ind, col_ind)
        s = df.shape[0]
        columns_per_row = df.apply(lambda x: np.unique(x))
        columns_per_row = columns_per_row.apply(lambda x: x.astype(int))
        columns_per_row = columns_per_row.apply(lambda x: list(x))
        columns_lists = columns_per_row.tolist()
        cols = list(itertools.chain.from_iterable(columns_lists))
        cols = np.array(cols)
        # our col_ind will be cols. Let us now create our row_ind
        lengths = list(columns_per_row.apply(lambda x: len(x)).values)
        rows = []
        for i in range(s):
            v = lengths[i] * [i]
            rows += v
        rows = np.array(rows)
        # rows is our row_ind. We now find the data to insert into our sparse matrix
        if mode == 'count':
            def counts(x):
                uniq, ret_counts = np.unique(x, return_counts=True)
                return ret_counts

            counted = df.apply(counts)
            counted = df.apply(lambda x: list(x)).tolist()
            data = list(itertools.chain.from_iterable(counted))
            data = np.array(data)

        else:
            data = np.ones(cols.shape[0])

        X = csr_matrix((data, (rows, cols)), shape=(df.shape[0], self.n + 1))
        return X[:, : -1]

    def vectorize(self, df, max_words=40, mode='binary', stemming=False):
        """
        Transforms a pandas series of strings to a csr-matrix with integer entries suitable for training a classifier.

        The transformation is carried out by first applying self.encode to each entry of the series and then applying
        self.vectorize_encoded to the result.

        :param df: Series of (mathematical) text to be transformed.

        :param max_words: optional, default = 40. Maximum number of words to consider in each entry of the Series

        :param mode: 'binary' or 'count'. see the documentation for self.vectorize_encoded.

        :return: Sparse matrix of vectorized samples.

        :type df: pandas Series

        :type max_words: int

        :type mode: str

        :rtype csr-matrix, shape = (df.shape[0], self.n)
        """
        encoded_df = df.apply(lambda x: self.encode(x, max_words=max_words, stemming=stemming))
        return self.vectorize_encoded(df=encoded_df, max_words=max_words, mode=mode)

Let’s now vectorise the dataset. While we are at it we will also split the data into a training and a test set where the former contains 80% of the samples.

""" 
Script for converting the titles in AGNT.csv to sparse vectors with 2**21 features that can be used for training our
chain classifier. In this process we will construct a training set and a test set that we save for later use.
"""

import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
import scipy.sparse 
from math_vectorizer import MathHashingVectorizer
# We set the random seed for reproducibility
np.random.seed(1)
tf.set_random_seed(1)

data = pd.read_csv('AGNT.csv')

# Extract 80% of the data frame at random for the training sample and the 
# remaining 20% for testing.

random_training_data = data.sample(frac=0.8, random_state=1)
random_test_data = data.drop(index=random_training_data.index)

# Now we start vectorizing our training data 
vectorizer = MathHashingVectorizer(random_seed=1, n=2**21)
X_train = vectorizer.vectorize(random_training_data['Titles'], max_words=15)
# Create an ndarray with the corresponding training labels 
y_train = random_training_data[random_training_data.columns[1:]].values

# Similarly we construct the data set for testing
X_test = vectorizer.vectorize(random_test_data['Titles'], max_words=15)
y_test = random_test_data[random_test_data.columns[1:]].values 

# Finally we save our work

# We first save the data frames 

store = pd.HDFStore('training_and_test_df.h5')
store['random_training_data'] = random_training_data 
store['random_test_data'] = random_test_data

# Now we store the sparse matrices of training and test features respectively. 

scipy.sparse.save_npz('X_train.npz', X_train)
scipy.sparse.save_npz('X_test.npz', X_test)

# The training and test labels are numpy ndarrays and are also easily saved. 

np.save('y_train.npy', y_train)
np.save('y_test.npy', y_test)

Training our classifier chain

We are now ready to assemble everything and train a CC to predict algebraic geometry and/or number theory based on a paper’s title. In our example we will let the first classifier in the chain have one hidden layer with 50 units, and we will set dropout to $0.1$ as an attempt to reduce overfitting. Moreover our optimizer will be stochastic gradient descent with a learning rate set to $0.1$ and we use binary crossentropy as our loss function. We let the ClassifierChain constructor infer the configuration of the second classifier for us. Throughout the training phase of the CC we will use the same sample weights which we get from ClassifierChain.compute_sample_weights where we shall set the debalancing parameter to $1.4$.

Here is the code:

import numpy as np
import scipy.sparse 
import tensorflow as tf
import tensorflow.keras as keras
from classifier_chains import ClassifierChain

np.random.seed(0)
tf.set_random_seed(0)

X_train = scipy.sparse.load_npz('X_train.npz')
X_test = scipy.sparse.load_npz('X_test.npz')

y_train = np.load('y_train.npy')

model = keras.models.Sequential()
model.add(keras.layers.Dense(units=50, input_dim=X_train.shape[1], activation='relu'))
model.add(keras.layers.Dropout(0.1))
model.add(keras.layers.Dense(units=2, input_dim=50, activation='softmax'))

sgd_optimizer = keras.optimizers.SGD(lr=0.1, decay=1e-15, momentum=.9)

model.compile(optimizer=sgd_optimizer, loss='binary_crossentropy')

optimizers= [sgd_optimizer]
losses = ['binary_crossentropy']

cc = ClassifierChain(classifier=model, num_labels=2, name='AG_NT_multi_class_weights', optimizers=optimizers, losses=losses, create_missing=True)

sample_weights = cc.compute_sample_weights(y_train, debalancing=1.4)

predefined_sample_weights = [sample_weights, sample_weights]

cc.fit(X=X_train, y=y_train, epochs=15, batch_size=64, predefined_weights=predefined_sample_weights)

preds_train  = cc.predict(X_train)

preds_test = cc.predict(X_test)

np.save('preds_train_AG_NT_multi_class_weights.npy', preds_train)

np.save('preds_test_AG_NT_multi_class_weights.npy', preds_test)

For comparison we train some more classifiers with different weight settings.

cc = ClassifierChain(classifier=model, num_labels=2, name='AG_NT_chain', optimizers=optimizers, losses=losses, create_missing=True)

debalancing = [0, 1.4] 
cc.fit(X=X_train, y=y_train, epochs=15, batch_size=64, weights_mode='chain', debalancing=debalancing)

preds_train_AG_NT_chain = cc.predict(X_train)

preds_test_AG_NT_chain = cc.predict(X_test)

np.save('preds_train_AG_NT_chain.npy', preds_train_AG_NT_chain)

np.save('preds_test_AG_NT_chain.npy', preds_test_AG_NT_chain)

cc = ClassifierChain(classifier=model, num_labels=2, name='AG_NT_no_sample_weights', optimizers=optimizers, losses=losses, create_missing=True)

cc.fit(X=X_train, y=y_train, epochs=15, batch_size=64)

preds_train_AG_NT_no_sample_weights = cc.predict(X_train)

preds_test_AG_NT_no_sample_weights = cc.predict(X_test)

np.save('preds_train_AG_NT_no_sample_weights.npy', preds_train_AG_NT_no_sample_weights)

np.save('preds_test_AG_NT_no_sample_weights.npy', preds_test_AG_NT_no_sample_weights)

and finally

cc = ClassifierChain(classifier=model, num_labels=2, name='AG_NT_no_weights_automatic', optimizers=optimizers, losses=losses, create_missing=True)

sample_weights = np.ones(y_train.shape[0])

predefined_sample_weights = [sample_weights]

cc.fit(X=X_train, y=y_train, epochs=15, batch_size=64, weights_mode='chain', predefined_weights=predefined_sample_weights, debalancing=[0, 1.4])

preds_train  = cc.predict(X_train)

preds_test = cc.predict(X_test)

np.save('preds_train_AG_NT_no_weights_automatic.npy', preds_train)

np.save('preds_test_AG_NT_no_weights_automatic.npy', preds_test)

Evaluation

Now that we have trained a few classifier chains, we should evaluate their predictions, but let us first briefly discuss how.

Evaluation metrics for multi-label classifiers

A natural, but very strict way to evaluate the performance of a multi-label classifier is the following: For a sample $s$ in our dataset $D$ we let $I(s)$ be 1 if the classifier predicted the correct multi-label and $0$ otherwise. If $|D|$ denotes the number of samples in the dataset then strict accuracy measure is then given by the formula

\[\begin{equation} \sum_{s \in D} I(s) / |D| \end{equation}\]

This gives a number between 0 and 1, where 0 means that our classifier did not manage to get all the labels correct for a single sample, and 1 means that all the multi-labels were correctly predicted.

It is however natural to say that a classifier that predicts some correct labels for a sample is better than one that doesn’t get a single one right. The strict accuracy measure does not take this into account, and it is perhaps a good idea to (also) use another evaluation measure. If we let $J(s)$ denote the number of correctly predicted labels for the sample $s$, and we let $K(s)$ denote the number of false positive and false negative predictions for $s$. Then the accuracy measure is given by the formula:

\[\begin{equation} \frac{\sum_{s \in D} J(s)/(K(s) + \text{number of labels})}{|D|} \end{equation}\]

Class for multi-label evaluation metrics

Let us write a class with various tools for evaluating the predictions made by a multi-label classifier. The class will in particular have a method called comparison_table that makes it convenient for us to compare the predictions made by several different classifiers.

Our code is presented here:

 import numpy as np
import pandas as pd


class MultilabelPredictionEvaluater(object):
    """
    Class for evaluation and comparison of multi-label classifiers.

    Most methods of this class take an ndarray of predictions as input and compare them (in some way) to the
    correct multi-labels which the constructor method stores as self.y.

    Methods:
        __init__: Constructor method.

        correct_predictions: Gives an ndarray indicating correctness of the predictions.

        correct_predictions_per_sample: Returns an ndarray indicating the number of correct predictions per sample.

        correct_predictions_per_label: Returns an ndarray indicating how many times each label was correctly predicted.

        num_correct_predictions: Returns the total number of correct predictions.

        strict_accuracy: Strict evaluation measure for multi-label classifiers.

        false_positives: Returns an ndarray indicating the presence of false positives among the predictions.

        false_positives_per_sample: Indicates the number of false positives per sample.

        false_positives_per_label: Returns an ndarray indicating the number of false positives per label.

        num_false_positives: Gives the total number of false positive predictions.

        false_negatives: Returns an ndarray indicating presence of false negatives among the predictions.

        false_negatives_per_sample: Indicates how many false negative predictions there are per sample.

        false_negatives_per_label: Indicates how many false negative predictions were made for each label.

        num_false_negatives: The total number of false negative predictions.

        accuracy: Standard evaluation measure for multi-label classifiers.

        preds_per_label: Returns the number of times each label was predicted.

        comparison_table: Returns a table comparing predictions performed by different classifiers.
    """
    def __init__(self, y):
        """
        Constructor method.
        :param y: Correct multi-labels.

        :type y: ndarray, shape = (num_samples, num_labels).
        """
        self.y = y

    def correct_predictions(self, x):
        """
        Returns an ndarray where entry i,j is True if the corresponding prediction was correct.

        :param x: Predictions

        :return: ndarray indicating correct predictions.

        :type x: ndarray, shape = (num_samples, num_labels) = self.y.shape

        :rtype : ndarray, shape = (num_samples, nun_labels) = self.y.shape = x.shape
        """
        comparison = np.equal(x, self.y)
        return comparison

    def correct_predictions_per_sample(self, x):
        """
        Returns an ndarray where the i'th entry is the number of correct predictions in sample i

        :param x: Predictions.

        :return: Array indicating the number of correct predictions per sample.

        :type x: ndarray, shape = (num_samples, num_labels) = self.y.shape

        :rtype: ndarray, shape = (num_samples,)
        """
        return np.sum(self.correct_predictions(x), axis=1)

    def correct_predictions_per_label(self, x):
        """
        Returns an ndarray where the i'th entry is the number of times the corresponding label was correctly predicted.

        :param x: Predictions.

        :return ndarray indicating the number of correct predictions per label.

        :type x: ndarray, shape = (num_samples, num_labels) = self.y.shape

        :rtype: ndarray, shape = (num_labels)
        """
        return np.sum(self.correct_predictions(x), axis=0)

    def num_correct_predictions(self, x):
        """
        Returns the total number of all correct predictions.

        :param x: Predictions.

        :return: The total number of correct predictions.

        :type x: ndarray, shape = (num_samples, num_labels) = self.y.shape

        :rtype: int
        """
        return int(np.sum(self.correct_predictions_per_sample(x), axis=0))

    def strict_accuracy(self, x):
        """
        A strict evaluation measure. Only the samples where every label was correctly predicted contribute to the score.

        :param x: Predictions.

        :return: The sum of all samples that were labeled correctly divided by the total number of samples.

        :type x: ndarray, shape = (num_samples, num_labels)

        :rtype: float
        """
        comparison = self.correct_predictions(x).astype(int)
        correct = np.ones(self.y.shape[0])
        for i in range(self.y.shape[1]):
            correct *= comparison[:, i]

        return np.sum(correct, axis=0) / self.y.shape[0]

    def false_positives(self, x):
        """
        Returns an ndarray where the i,j'th entry is True if the corresponding prediction was a false positive.

        :param x: Predictions.

        :return ndarray indicating false predictions.

        :type x: ndarray, shape = (num_samples, num_labels) = self.y.shape

        :rtype: ndarray, shape = (num_samples, num_labels) = x.shape = self.y.shape
        """
        ones = np.ones(shape=self.y.shape)
        fp = np.equal((x - self.y), ones)
        return fp

    def false_positives_per_sample(self, x):
        """
        Returns an ndarray where the i'th entry is the number of false positives predicted for sample i.

        :param x: Predictions.

        :return ndarray indicating the number of false positives per sample.

        :type x: ndarray, shape = (num_samples, num_labels) = self.y.shape

        :rtype: ndarray, shape = (num_samples,)
        """
        return np.sum(self.false_positives(x), axis=1)

    def false_positives_per_label(self, x):
        """
        Returns an ndarray indicating the number of false positives per label.

        The i'th entry of the returned ndarray is the number of times the corresponding label was a false positive
        among the predictions.

        :param x: Predictions.

        :return: ndarray indicating false positives per label.

        :type x: ndarray, shape = (num_samples, num_labels) = self.y.shape

        :rtype: ndarray, shape = (num_labels,)
        """
        return np.sum(self.false_positives(x), axis=0)

    def num_false_positives(self, x):
        """
        Returns the total number of all false positive predictions.

        :param x: Predictions.

        :return: Total number of false positive predictions.

        :type x: ndarray, shape = (num_samples, num_labels) = self.y.shape

        :rtype: int
        """
        return int(np.sum(self.false_positives_per_sample(x), axis=0))

    def false_negatives(self, x):
        """
        Returns an ndarray where the i,j'th entry is True if the corresponding prediction was a false negative.

        :param x: Predictions.

        :return: ndarray indicating false negatives.

        :type x: ndarray, shape = (num_samples, num_labels)

        :rtype: ndarray, shape = (num_samples, num_labels) = x.shape = self.y.shape
        """
        minus_ones = -1 * np.ones(shape=self.y.shape)

        fn = np.equal((x - self.y), minus_ones)

        return fn

    def false_negatives_per_sample(self, x):
        """
        Returns an ndarray where the i'th entry is the number of false negatives predicted for sample i.

        :param x: Predictions.

        :return: ndarray indicating false predictions per sample.

        :type x: ndarray, shape = (num_samples, num_labels)

        :rtype: ndarray, shape = (num_samples,)
        """
        return np.sum(self.false_negatives(x), axis=1)

    def false_negatives_per_label(self, x):
        """
        Returns an ndarray indicating the number of false positives among the labels.

        The i'th entry of the returned ndarray is the number of times the corresponding label was identified as
        a false negative among the predictions.

        :param x: Predictions.

        :return: ndarray indicating false negatives per label.

        :type x: ndarray, shape = (num_samples, num_labels) = self.y.shape

        :rtype: ndarray, shape = (num_labels,)
        """
        return np.sum(self.false_negatives(x), axis=0)

    def num_false_negatives(self, x):
        """
        Returns the total number of false negative predictions.

        :param x: Predictions.

        :return The total number of false negatives among the predictions.

        :type x: ndarray, shape = (num_samples, num_labels) = self.y.shape

        :rtype: int
        """
        return int(np.sum(self.false_negatives_per_sample(x), axis=0))

    def accuracy(self, x, lenience=None):
        """
        Standard evaluation measure for multi-label classification problems.

        This evaluation measure can for instance be found in the paper 'Classifier chains for
        multi-label classification' by Jessee Read et.al.

        :param x: Predictions.

        :param lenience: optional, default = None. If set to 'false positives' (resp. 'false negatives' )
        we are more lenient towards false positive (resp. false negative) predictions.

        :return sum of terms between 0.0 and 1/num_samples where each sample with at least one correctly predicted
        label contributes to the sum.

        :type x: ndarray, shape = (num_samples, num_labels)

        :type lenience: str

        :rtype: float
        """

        correct = self.correct_predictions_per_sample(x)

        num_labels_vector = self.y.shape[1]*np.ones(self.y.shape[0])
        fp = self.false_positives_per_sample(x)
        fn = self.false_negatives_per_sample(x)

        denominator = num_labels_vector + fp + fn

        if lenience == 'false positives':
            denominator -= fp

        elif lenience == 'false negatives':
            denominator -= fn

        return np.sum(correct/denominator, axis=0) / self.y.shape[0]

    @staticmethod
    def preds_per_label(x):
        """
        Returns an ndarray where the i'th entry is the number of times the corresponding label was predicted.

        :param x: Predictions.

        :return: ndarray indicating predictions per label.

        :type x: ndarray, shape = (num_samples, num_labels)

        :rtype: ndarray, shape = (num_labels,)
        """
        return np.sum(x, axis=0)

    def comparison_table(self, predictions, labels):
        """
        Returns a table comparing predictions performed by different classifiers.

        More precisely the method creates a pandas DataFrame with columns: strict accuracy, accuracy, false positives,
        false negatives, most false positives and most false negatives. The i'th row corresponds to the i'th element
        in the input list of predictions.

        :param predictions: List of ndarray's of shape = (num_samples, num_labels)

        :param labels: List of length num_labels where every entry is a string.

        :return: comparison table

        :type predictions: list

        :type labels: list

        :rtype: pandas DataFrame
        """
        columns = ['strict accuracy', 'accuracy', 'false positives ', 'false negatives',
                   'most false positives', 'most false negatives']
        # We will fill in the values in the pandas DataFrame by applying functions to each column. Let us first
        # set create a DataFrame of shape (len(predictions), len(columns)) where all entries of the i'th row
        # is the integer i.
        temp_list = list(np.arange(len(predictions)))
        temp_list = len(columns) * [temp_list]
        temp_array = np.array(temp_list).T
        temp_array = temp_array.astype(int)

        df = pd.DataFrame(temp_array, index=list(range(len(predictions))), columns=columns)

        # We now create the functions that are to be applied to each column of our DataFrame respectively.
        def f_1(x):
            return self.strict_accuracy(predictions[x])

        def f_2(x):
            return self.accuracy(predictions[x])

        def f_3(x):
            return self.num_false_positives(predictions[x])

        def f_4(x):
            return self.num_false_negatives(predictions[x])

        def f_5(x):
            return labels[int(np.argmax(self.false_positives_per_label(predictions[x])))]

        def f_6(x):
            return labels[int(np.argmax(self.false_negatives_per_label(predictions[x])))]

        functions_list = [f_1, f_2, f_3, f_4, f_5, f_6]
        column_function_dict = {key: value for (key, value) in zip(columns, functions_list)}

        # We iterate over the columns in df where column_series is the Series corresponding to column
        # in the list columns. For each of these columns we apply the function corresponding to column and update
        # the values in df.
        for column, column_series in df.iteritems():
            df.update(column_series.apply(column_function_dict[column]))

        return df

Before moving on to evaluating our classifiers, we quickly write a class that lets us easily extract certain subsets of our datasets. Thus is convenient because we want to know in particular how well our classifiers perform on the minority class, i.e. the subset of titles labelled both algebraic geometry and number theory.

Here is the code:


class MaskCreater(object):
    """
    Class for extracting several interesting sub-sets from a given data set.

    Methods:
        __init__: Constructor method.
        __call__: Creates a mask that can be used to extract the sub-set where certain labels (don't) appear.
    """
    def __init__(self, y):
        """
        Constructor method.

        :param y: Multi-labels
        :type y: ndarray, shape = (num_samples, num_labels).
        """
        self.y = y

    def __call__(self, col_ones=None, col_zeros=None):
        """
        Call method.

        This method constructs a mask that can be used to extract the subset of our data set of samples labeled with
        a given set of labels and not labeled with another given set of labels.

        :param col_ones: List of indices corresponding to the columns corresponding to the labels that should appear.

        :param col_zeros: List of indices corresponding to the columns corresponding to the labels that we do not
         want to appear.

        :return: Mask to extract the desired sub-data set.

        :type col_ones: list

        :type col_zeros: list

        :rtype: ndarray, shape = (num_labels,)

        """
        ones = np.ones(shape=self.y.shape)
        zeros = np.zeros(shape=self. y.shape)
        comp_ones = np.equal(self.y, ones)
        comp_zeros = np.equal(self.y, zeros)

        one_mask = np.ones(self.y.shape[0]).astype(bool)

        if col_ones is None:
            col_ones = []

        for index in col_ones:
            one_mask *= comp_ones[:, index]

        zero_mask = np.ones(self.y.shape[0]).astype(bool)

        if col_zeros is None:
            col_zeros = []

        for index in col_zeros:
            zero_mask *= comp_zeros[:, index]

        return one_mask * zero_mask

Comparing our trained CC’s

We are now ready to compare the predictions of our CC’s. We first do this on the training set using the following script:


"""
Script for evaluating our trained CC's
"""


import numpy as np
import pandas as pd 
from multilabel_evaluation import MultilabelPredictionEvaluater, MaskCreater

# Load the training labels and the corresponding predictions of our CC's:

# The correct labels. 
y_train = np.load('y_train.npy')

# Predictions by the CC  trained in chain mode with debalancing [0, 1.4]
preds_train_AG_NT_chain = np.load('preds_train_AG_NT_chain.npy')

# Predictions by the CC trained with multi-class weighting (debalancing = 1.4).
preds_train_AG_NT_multi_class_weights = np.load('preds_train_AG_NT_multi_class_weights.npy')

# Predictions by the CC trained without sample weighting 
preds_train_AG_NT_no_sample_weights = np.load('preds_train_AG_NT_no_sample_weights.npy')

# Predictions by the CC where the first classifier was trained with no sample weights while the second used automatic sample 
# weighting with debalancing set to 1.4 
preds_train_AG_NT_no_weights_automatic = np.load('preds_train_AG_NT_no_weights_automatic.npy')

preds_train_list = [preds_train_AG_NT_chain, preds_train_AG_NT_multi_class_weights, preds_train_AG_NT_no_sample_weights]
preds_train_list += [preds_train_AG_NT_no_weights_automatic]

# We now write some auxiliary functions that we use to extract subsets of our training sets 
def produce_masks_AGNT(y):
    """
    Function that creates masks allowing for extraction of interesting subsets of our AGNT training and test sets.

    More precisely masks are created that extract each of the following subsets respectively:
    samples labeled 'Algebraic geometry', samples labeled 'Number theory', samples labeled 'Algebraic geometry and not
    Number theory', samples labeled 'Number theory but not Algebraic geometry', samples labeled 'Algebraic geometry and
    Number theory' and finally all samples.

    The function returns a dictionary where the keys are descriptions of the subsets the corresponding masks extract.


    :param y: Multi-labels

    :return: Dictionary between subset descriptions and masks

    :type y: ndarray, shape = (num_samples, num_labels)

    :rtype: dict
    """
    mask_creater = MaskCreater(y)
    masks = []
    mAG = mask_creater(col_ones=[0])
    masks.append(mAG)
    mNT = mask_creater(col_ones=[1])
    masks.append(mNT)
    mAGnotNT = mask_creater(col_ones=[0], col_zeros=[1])
    masks.append(mAGnotNT)
    mNTnotAG = mask_creater(col_ones=[1], col_zeros=[0])
    masks.append(mNTnotAG)
    mAGandNT = mask_creater(col_ones=[0, 1])
    masks.append(mAGandNT)
    mAgorNT = mask_creater()
    masks.append(mAgorNT)

    names = ['Algebraic geometry', 'Number Theory', 'Algebraic geometry but not Number theory']
    names += ['Number theory but not Algebraic geometry', 'Algebraic geometry and Number theory', 'No mask']

    names_mask_dict = {key: value for (key, value) in zip(names, masks)}

    return names_mask_dict



def apply_mask_preds(mask, preds):
    """
    Function for extracting subsets of predictions.

    :param mask: Mask that extracts the subsets
    :param preds: List of predictions
    :return: List of extracted predictions

    :type mask: ndarray, entries must be of type bool
    :type preds: list, entries must be ndarrays with shape[0] = mask.shape[0]
    :rtype: list
    """
    return [pred[mask] for pred in preds]

names_mask_train = produce_masks_AGNT(y_train)

# The labels of our classes are AG for algebraic geometry and NT for number theory 
labels = ['AG', 'NT']

print('We compare the predictions on various subsets of the training set')
for key, value in names_mask_train.items():
    print(key)
    evaluater = MultilabelPredictionEvaluater(y_train[value])
    tbl = evaluater.comparison_table(predictions=apply_mask_preds(value, preds_train_list), labels=labels)
    tbl = tbl.rename({0: 'chain', 1:'multi-class weights' , 2: 'no weights', 3: 'no weight + automatic'},axis='index')
    tbl = tbl.sort_values(by=['accuracy'], ascending=False)
    print(tbl)
    print()
    print()

This program gives the following output:

We compare the predictions on various subsets of the training set
Algebraic geometry
                       strict accuracy  accuracy  false positives   false negatives most false positives most false negatives
no weights                    0.997490  0.998296                 5               51                   NT                   AG
multi-class weights           0.990704  0.993586               133               81                   NT                   AG
no weight + automatic         0.988844  0.992563               158               82                   NT                   NT
chain                         0.980524  0.987016               360               59                   NT                   NT


Number Theory
                       strict accuracy  accuracy  false positives   false negatives most false positives most false negatives
no weights                    0.996424  0.997616                17               45                   AG                   NT
chain                         0.995097  0.996731                27               58                   AG                   NT
no weight + automatic         0.994636  0.996405                17               77                   AG                   NT
multi-class weights           0.988983  0.992559               177               19                   AG                   AG


Algebraic geometry but not Number theory
                       strict accuracy  accuracy  false positives   false negatives most false positives most false negatives
no weights                    0.999526  0.999649                 5                6                   NT                   AG
no weight + automatic         0.991363  0.994242               158                6                   NT                   AG
multi-class weights           0.990152  0.993189               133               68                   NT                   AG
chain                         0.980884  0.987256               360                3                   NT                   AG


Number theory but not Algebraic geometry
                       strict accuracy  accuracy  false positives   false negatives most false positives most false negatives
no weights                    0.998852  0.999235                17                0                   AG                   AG
no weight + automatic         0.998852  0.999212                17                1                   AG                   NT
chain                         0.998042  0.998695                27                2                   AG                   NT
multi-class weights           0.987983  0.991876               177                6                   AG                   NT


Algebraic geometry and Number theory
                       strict accuracy  accuracy  false positives   false negatives most false positives most false negatives
multi-class weights           0.994851  0.996568                 0               13                   AG                   AG
no weights                    0.982178  0.988119                 0               45                   AG                   NT
chain                         0.977822  0.985215                 0               56                   AG                   NT
no weight + automatic         0.969901  0.979934                 0               76                   AG                   NT


No mask
                       strict accuracy  accuracy  false positives   false negatives most false positives most false negatives
no weights                    0.998045  0.998679                22               51                   AG                   AG
no weight + automatic         0.992925  0.995274               175               83                   NT                   NT
multi-class weights           0.989594  0.992888               310               87                   AG                   AG
chain                         0.987667  0.991778               387               61                   NT                   NT

On the training set all the classifiers do really well! Note that the subset of samples labeled both as algebraic geometry and number theory is the only subset where the CC trained without weighted samples doesn’t perform the best.

Let us now change our script so we see how the predictions for the test set compare.


# Load the test labels and the corresponding predictions of our CC's

# The correct labels 
y_test = np.load('y_test.npy')

# Predictions by the CC  trained in chain mode with debalancing [0, 1.4]
preds_test_AG_NT_chain = np.load('preds_test_AG_NT_chain.npy')

# Predictions by the CC trained with multi-class weighting (debalancing = 1.4).
preds_test_AG_NT_multi_class_weights = np.load('preds_test_AG_NT_multi_class_weights.npy')

# Predictions by the CC trained without sample weighting 
preds_test_AG_NT_no_sample_weights = np.load('preds_test_AG_NT_no_sample_weights.npy')

# Predictions by the CC where the first classifier was trained with no sample weights while the second used automatic sample 
# weighting with debalancing set to 1.4 
preds_test_AG_NT_no_weights_automatic = np.load('preds_test_AG_NT_no_weights_automatic.npy')

preds_test_list = [preds_test_AG_NT_chain, preds_test_AG_NT_multi_class_weights, preds_test_AG_NT_no_sample_weights]
preds_test_list += [preds_test_AG_NT_no_weights_automatic]

names_mask_test = produce_masks_AGNT(y_test)

print('We compare the predictions on subsets of the test set' + '\n')
for key, value in names_mask_test.items():
    print(key)
    evaluater = MultilabelPredictionEvaluater(y_test[value])
    tbl = evaluater.comparison_table(predictions=apply_mask_preds(value, preds_test_list), labels=labels)
    tbl = tbl.rename({0: 'chain', 1:'multi-class weights' , 2: 'no weights', 3: 'no weights + automatic'},axis='index')
    tbl = tbl.sort_values(by=['accuracy'], ascending=False)

    print(tbl)
    print()
    print()

which gives the output:

We compare the predictions on subsets of the test set

Algebraic geometry
                        strict accuracy  accuracy  false positives   false negatives most false positives most false negatives
no weights                     0.833112  0.867273               441              780                   NT                   AG
chain                          0.824777  0.866957               540              642                   NT                   AG
multi-class weights            0.826672  0.864431               567              665                   NT                   AG
no weights + automatic         0.820799  0.859064               516              770                   NT                   AG


Number Theory
                        strict accuracy  accuracy  false positives   false negatives most false positives most false negatives
multi-class weights            0.770136  0.815988               666              758                   AG                   NT
no weights + automatic         0.772398  0.815460               576              865                   AG                   NT
no weights                     0.770136  0.813198               576              885                   AG                   NT
chain                          0.766063  0.809276               647              848                   AG                   NT


Algebraic geometry but not Number theory
                        strict accuracy  accuracy  false positives   false negatives most false positives most false negatives
no weights                     0.905405  0.912627               441              340                   NT                   AG
chain                          0.884170  0.904404               540              257                   NT                   AG
no weights + automatic         0.889318  0.901902               516              340                   NT                   AG
multi-class weights            0.878164  0.896182               567              317                   NT                   AG


Number theory but not Algebraic geometry
                        strict accuracy  accuracy  false positives   false negatives most false positives most false negatives
no weights + automatic         0.848541  0.860899               576              435                   AG                   NT
no weights                     0.848541  0.860023               576              445                   AG                   NT
multi-class weights            0.824086  0.847051               666              410                   AG                   NT
chain                          0.829345  0.845823               647              463                   AG                   NT


Algebraic geometry and Number theory
                        strict accuracy  accuracy  false positives   false negatives most false positives most false negatives
multi-class weights            0.437601  0.624527                 0              348                   AG                   NT
chain                          0.376013  0.584009                 0              385                   AG                   NT
no weights + automatic         0.303079  0.535386                 0              430                   AG                   NT
no weights                     0.286872  0.524581                 0              440                   AG                   NT


No mask
                        strict accuracy  accuracy  false positives   false negatives most false positives most false negatives
no weights                     0.839573  0.864237              1017             1225                   AG                   NT
no weights + automatic         0.832416  0.859833              1092             1205                   AG                   NT
chain                          0.826690  0.858108              1187             1105                   AG                   NT
multi-class weights            0.825589  0.857153              1233             1075                   AG                   NT

We definitely see the result of overfitting, but still our classifiers make reasonably good predictions. Again the classifier chain trained without sample weighting tends to do the best apart from on the samples labelled with both labels. On this latter subset the CC trained with multi-class weighting (with debalancing set to 1.4) performs the best, which was also the case on the training samples. Notice further that the overfitting is a little bad on this subset consisting of the samples in the minority class. This can probably be partially explained by inconsistent labelling by arXiv users, but can we get better results if we set the debalancing parameter to a smaller value?

That was all for this post. The repository for the project we just explained can be found here