1. Using Apache Spark Neural Networks to Recognise Digits

    One of the famous machine learning challenges is the performing handwritten character recognition (classification) over the MNIST database of handwritten digits. The MNIST dataset has a training set of 60,000 and a test set of 10,000 28x28 pixel images of handwritten digits and an integer value between 0…

    on tutorial spark sparkml scala

  2. AffineTransform Transformer for Apache Spark ML

    Whilst playing with the MNIST dataset I found I needed a way of rotating images and so I decided to build an Affine Transform Transformer for Apache Spark ML. I have implemented the basic Affine Transformation operations: rotate, scaleX, scaleY, shearX, shearY, translateX, translateY. Any pixel which exceeds the image…

    on spark sparkml scala

  3. A Date Hierarchy for Neo4j

    I wrote this a while ago based on this excellent post and added a few more attributes. Given that Neo4j doesn't have a datatype to deal with dates it might come in handy for you too. It will generate a calendar between the years specified at the top of the…

    on neo4j cypher

  4. Setting up a Standalone Apache Spark Cluster

    A few people have asked me how to set up a small Standalone Spark Cluster for testing. Here are the scripts for Ubuntu 15.10 to install Apache Spark 1.6.0 which should have you up and running very quickly. This guide assumes you have done new installation of…

    on tutorial spark

  5. A better Binarizer for Apache Spark ML

    Update: This code has been approved and should appear in Apache Spark 2.0.0. The Binarizer transformer (API) is part of the core Apache Spark ML package. Its job is simple: compare a series of numbers against a threshold value and if the value is greater than the threshold…

    on spark sparkml scala

  6. Porter Stemming in Apache Spark ML

    As I have been playing with Apache Spark ML and needed a stemming algorithm I decided to have a go and write a custom transformer myself. As of Spark 1.5.2 Stemming has not been introduced (should be in 1.7.0) but I have taken the Porter Stemmer…

    on spark scala sparkml

  7. Natural Language Processing with Apache Spark ML and Amazon Reviews (Part 2)

    Continues from Part 1. 4 Execution 4.1 The Pipeline Now we have all the components of the pipeline ready all that is needed is to load them into the Spark ML Pipeline(). A pipeline helps with the sequencing of stages so that we can automate the pipeline in the…

    on tutorial spark sparkml scala

  8. Natural Language Processing with Apache Spark ML and Amazon Reviews (Part 1)

    The most exciting feature of Apache Spark is it's 'generality' meaning the ability to rapidly take some text data, transform it to a graph structure and perform some network analysis with GraphX take that dataset and apply some machine learning algorithms with SparkML and store it in memory and query…

    on tutorial spark sparkml scala

  9. Performance Tuning Spark WikiPedia PageRank

    In my previous post I wrote some code to demonstrate how to go from the raw database extracts provided monthly by WikiPedia through to loading into Apache Spark GraphX and running PageRank. In this post I will discuss my efforts to make that process more efficient which may be relevant…

    on tutorial spark scala graphx

  10. Computing WikiPedia's internal PageRank with Apache Spark

    Recently I have spent a lot of time reading and learning about graphs and graph analytics which naturally drew me to Apache Spark GraphX having previously played with Neo4J. The benefits of GraphX are: fully open source scalable using the Apache Spark model written in Scala which I have been…

    on tutorial spark scala graphx