I am Michael Firman, and I work UCL in the Vision and Graphics group. I am currently working as a postdoc on the Engage project, making machine learning tools accessible to scientists across different disciplines.
My PhD has been supervised by Dr. Simon Julier and Dr. Jan Boehm. During my PhD I predominantly worked on the problem of inferring a full volumetric reconstruction of a scene, given only a single depth image as input. This problem has many applications in robotics, computer graphics and entertainment devices.
During the summer of 2012 I worked at the National Institute of Informatics, Tokyo, under the supervision of Prof. Akihiro Sugimoto.
There are many great lists of computer vision datasets on the web but no dedicated source for datasets captured by Kinect or similar devices. I created this list in an attempt to remedy the situation.
Computer Vision and Pattern Recognition (CVPR) 2018
Many structured prediction tasks in machine vision have a collection of acceptable answers, instead of one definitive ground truth answer. Segmentation of images, for example, is subject to human labeling bias. Similarly, there are multiple possible pixel values that could plausibly complete occluded image regions. State-of-the art supervised learning methods are typically optimized to make a single test-time prediction for each query, failing to find other modes in the output space. Existing methods that allow for sampling often sacrifice speed or accuracy.
We introduce a simple method for training a neural network, which enables diverse structured predictions to be made for each test-time query. For a single input, we learn to predict a range of possible answers. We compare favorably to methods that seek diversity through an ensemble of networks. Such stochastic multiple choice learning faces mode collapse, where one or more ensemble members fail to receive any training signal. Our best performing solution can be deployed for various tasks, and just involves small modifications to the existing single-mode architecture, loss function, and training regime. We demonstrate that our method results in quantitative improvements across three challenging tasks: 2D image completion, 3D volume estimation, and flow prediction.
Cities support unique and valuable ecological communities, but understanding urban wildlife is limited due to the difficulties of assessing biodiversity. Ecoacoustic surveying is a useful way of assessing habitats, where biotic sound measured from audio recordings is used as a proxy for biodiversity. However, existing algorithms for measuring biotic sound have been shown to be biased by non-biotic sounds in recordings, typical of urban environments. We develop CityNet, a deep learning system using convolutional neural networks (CNNs), to measure audible biotic (CityBioNet) and anthropogenic (CityAnthroNet) acoustic activity in cities. The CNNs were trained on a large dataset of annotated audio recordings collected across Greater London, UK.
We found that our deep learned model outperformed existing measures of both biotic and anthropogenic sound. Predictions from our trained model can be visualised on this website, showing the distribution of sounds across London, and over different times of day.
PLoS Computational Biology 2018
There is a critical need for robust and accurate tools to scale up biodiversity monitoring and to manage the impact of anthropogenic change. For example, the monitoring of bat species and their population dynamics can act as an important indicator of ecosystem health as they are particularly sensitive to habitat conversion and climate change. In this work we propose a fully automatic and efficient method for detecting bat echolocation calls in noisy audio recordings. We show that our approach is more accurate compared to existing algorithms and other commercial tools. Our method enables us to automatically estimate bat activity from multi-year, large-scale, audio monitoring programmes.
CHI 2018: Late Breaking Work 2018
On the surface, task-completion should be easy in graphical user interface (GUI) settings. In practice however, different actions look alike and applications run in operating-system silos. Our aim within GUI action recognition and prediction is to help the user, at least in completing the tedious tasks that are largely repetitive. We propose a method that learns from a few user-performed demonstrations, and then predicts and finally performs the remaining actions in the task. For example, a user can send customized SMS messages to the first three contacts in a school's spreadsheet of parents; then our system loops the process, iterating through the remaining parents.
CVPR Workshop on Large Scale 3D Data: Acquisition, Modelling and Analysis 2016
Since the launch of the Microsoft Kinect, scores of RGBD datasets have been released. These have propelled advances in areas from reconstruction to gesture recognition. In this paper we explore the field, reviewing datasets across eight categories: semantics, object pose estimation, camera tracking, scene reconstruction, object tracking, human actions, faces and identification. By extracting relevant information in each category we help researchers to find appropriate data for their needs, and we consider which datasets have succeeded in driving computer vision forward and why.
Finally, we examine the future of RGBD datasets. We identify key areas which are currently underexplored, and suggest that future directions may include synthetic data and dense reconstructions of static and dynamic scenes.
Computer Vision and Pattern Recognition (CVPR) 2016 (Oral)
Building a complete 3D model of a scene, given only a single depth image, is underconstrained. To gain a full volumetric model, one needs either multiple views, or a single view together with a library of unambiguous 3D models that will fit the shape of each individual object in the scene.
We hypothesize that objects of dissimilar semantic classes often share similar 3D shape components, enabling a limited dataset to model the shape of a wide range of objects, and hence estimate their hidden geometry. Exploring this hypothesis, we propose an algorithm that can complete the unobserved geometry of tabletop-sized objects, based on a supervised model trained on already available volumetric elements. Our model maps from a local observation in a single depth image to an estimate of the surface shape in the surrounding neighborhood. We validate our approach both qualitatively and quantitatively on a range of indoor object collections and challenging real scenes.
International Conference on Intelligent Robots and Systems (IROS) 2013
We introduce a method to discover objects from RGB-D image collections which does not require a user to specify the number of objects expected to be found. We propose a probabilistic formulation to find pairwise similarity between image segments, using a classifier trained on labelled pairs from the recently released RGB-D Object Dataset. We then use a correlation clustering solver to both find the optimal clustering of all the segments in the collection and to recover the number of clusters. Unlike traditional supervised learning methods, our training data need not be of the same class or category as the objects we expect to discover. We show that this parameter-free supervised clustering method has superior performance to traditional clustering methods.
Further information: This work was begun during an internship at NII, Tokyo in the summer of 2012, and was partially supported by an NII internship grant.
International Conference on Intelligent Robots and Systems (IROS) 2011
Recent work in the domain of classification of point clouds has shown that topic models can be suitable tools for inferring class groupings in an unsupervised manner. However, point clouds are frequently subject to non-negligible amounts of sensor noise. In this paper, we analyze the effect on classification accuracy of noise added to both an artificial data set and data collected from a Light Detection and Ranging (LiDAR) scanner, and show that topic models are less robust to 'misspelled' words than the more naive k‑means classifier. Furthermore, standard spin images prove to be a more robust feature under noise than their derivative, 'angular' spin images. We additionally show that only a small subset of local features are required in order to give comparable classification accuracy to a full feature set.
A digital copy of Dr. Simon Prince's book 'Computer Vision: Models, Learning, and Inference', which forms a core of the syllabus, can be downloaded from www.computervisionmodels.com
I have an interest in the lyrical and musical content of music, and it seems sensible to try to use a computer to automate some of the process of discovering themes and trends in music.
As an experiment in Python, I wrote a program to analyse the lyrics of 33,000 songs from the US 40, from 1955 to the present day. I used Brian Langenberger's gdbm-based rhyming dictionary to automatically detect rhyme pairs in each song. This allowed the most popular pairs of rhyming words to be discovered, and changing trends in rhymes over time to be analysed.
A wordcloud showing the most popular rhymes in the whole corpus can be downloaded from the sidebar to the left.
As far as I am aware, this is the first time that rhyme analysis has been performed in this way, and on such a large scale. I presented a preliminary version of this work at the Comparative Innovations Workshop at King's College London, in May 2013. When I have the time I will publish a version of this work here, with explanations of the methodology used and more results.