Small logo of ETH main building ETH Zurich : Computer Science : Pervasive Computing : Distributed Systems : Education : Student Projects : Abstract

Crowdsourcing Performances in Energy Datasets Labeling (B)

Status: Abgeschlossen


Supervised and semi-supervised algorithms often require labelled data as ground truth used to test and validate learning models. In some domains, such as Computer Vision, this often consists of labeling images or identifying and locating objects in images. Similarly, opinion mining requires the extraction of the sentiment that is expressed in given sentences. Those microtasks are however easily fetched to multiple users on crowdsourcing platforms (such as CrowdFlower, Amazon Mechanical Turk, etc.) as they are easily carried out by any user. Is this also applicable to labeling energy data (i.e. power traces)?

Crowdsourcing microtasks consists in trusting the wisdom of the crowd for executing microtasks. However can the quality of the results of the work of users on crowd platforms, where malicious or careless users might be tempted to execute their task poorly and still being monetarily rewarded for unreliable work? While stategies exists to insert test questions to users to identify unreliable workers, in some cases, the quality of the results can also depend on the definition of the task handed out to the workers (it can be ambiguous) or its difficulty.

Obtaining energy datasets is a tedious and costly effort, where houses need to be fully instrumented. Often, the collected data consist in power data, but higher level metadata such as when and who turned on a given appliance or what type of activity (cooking, etc.) are not available. Instead of collecting data from scratch, using existing datasets and developing learning algorithms to extract these metadata could be undertaken. However, this also means that ground truth data should be acquired. Labeling energy datasets incorporates additional challenges: the nature of the data (time series) and the necessary knowledge to carry out the task accurately and successfully.


The goal of this thesis is to evaluate the performance of regular workers that could typically be hired on a platform such as Amazon Mechanical Turk against the work of expert users (with knowledge about energy profiles of appliances) in the task of labeling power time series through our CAFED platform.


This project requires interest in data mining and database systems. The student is expected to be familiar with relational database architectures (preferrably PostGreSQL), Unix systems and web development (Apache, PHP, Javascript). The student should be interested in working with new tools and have good coding practices (coding style and documentation of code). Students are expected to be highly motivated to work on their topics and to cooperate with their supervisor regularly to discuss current progress and next steps.

Student/Bearbeitet von: Felix Rauchenstein
Contact/Ansprechpartner: Hông-Ân Cao

ETH ZurichDistributed Systems Group
Last updated May 22 2017 02:02:37 PM MET hac