Defining a Task

The first two deliverables for your course project center around 'Defining your Task'. We will work together to ensure that you are proposing a challenge that is accomplishable by machine learning in the timeframe we have, with the resources that we have.

This is an open-ended project. Any domain is on the table.

Help! I don’t have any ideas.

You may want to check out this list of Awesome Public Datasets for inspiration.

This list has a handful of challenges:

  • Just because a dataset is public doesn’t mean it is free ($$$), or ethical to use.
  • A dataset alone is not enough... e.g., MusixMatch has a lyrics dataset, but you need a question to answer over it.

Continue reading for more generalized concerns.

Potential Task Concerns:

Open-ended machine learning projects are difficult primarily because of the data needed to do machine learning. However, subject to these constraints, I would like to support student projects interested in arbitrary real-world data.

The following questions will help you determine if your dataset is appropriately setting your project up for success.

Do I have access to the original data?

A lot of datasets come pre-processed into numbers to be crunched by Machine Learning algorithms. For example, Kaggle hosts machine learning competitions and has a lot of cool data, in theory. However, most Kaggle datasets are usually “just features” (just numbers) and not the original, raw data.

It is common, for instance, for lovers of music to want to do something involving song lyrics or audio tracks, but these are difficult to acquire due to copyright restrictions. Alternative data may be simpler to acquire: Wikipedia has pages for famous songs -- but no lyrics.

Is my task specific enough?

Wanting to design a chatbot is a goal that will involve machine learning, however, the problem is broad and poorly specified: what makes a good chatbot?

A more specific task would be one of message classification. Did the user type a question or a statement? In casual chats, the question-mark may not be present.

Will my task be accomplishable with just a few hundred labels?

In order to build a machine learning system to help with your task, it must be learned from labeled data. Simple tasks may start giving good results with just a few hundred labels, but others may require thousands or millions. A machine cannot learn to do what a human cannot do in the first place, and it would be wise to choose a task that is rather trivial for a human, but complex for a machine.

Sometimes, tasks are harder than you think, even for humans. e.g., Duck or Rabbit?.

To ensure sufficient, quality labels (for some tasks), we will likely want to group students onto shared projects.

Do I have the computational resources to manage this data?

Computational Resources. Images and videos tend to be much larger (on disk) than text documents. Try to find a dataset that will be less than 5GiB or 10GiB. Working with more than this data would require a lot of storage space and a lot of compute time.

Will you just give me some idea?

My research is primarily in text domains, so all my suggestions will be in this direction. If you do not care about any specific real-world applications, I will happily go ahead and give you some of mine.

The main task I will likely suggest to you is whether or not a specific page on Wikipedia is a X or not. Wikipedia is nice because you can dream up any question you might care about and there’s likely to be some pages that fit.