ACM SIGMOD 2021 Programming Contest

Task Details

The task consists of identifying which instances, described by properties (i.e., attributes), represent the same real-world entity.

Participants are asked to solve the task among several datasets of different types (e.g., products, people, etc.) that will be released progressively. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns); we will refer to each of these datasets as D_i.

For each dataset D_i, participants will be provided with the following resources:

X_i : a subset of the instances in D_i
Y_i : matching/non-matching labels for pairs in X_i x X_i
D_i metadata (e.g., how many instances it contains, what are the main characteristics)

Note that Y datasets are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C).

Solutions will be evaluated over Z_i = D_i \ X_i. Note that the instances in Z_i will not be provided to participants. More details are available in the Evaluation Process section.

Both X_i and Y_i are in CSV format.

Example of dataset X_i

instance_id	attr_name_1	attr_name_2	...	attr_name_k
00001	value_1	null	...	value_k
00002	null	value_2	...	value_k
...	...	...	...	...

Example of dataset Y_i

left_instance_id	right_instance_id	label
00001	00002	1
00001	00003	0
...	...	...

More details about the datasets can be found in the dedicated Datasets section.

Your goal is to find, for each X_i dataset, all pairs of instances that match (i.e., refer to the same real-world entity). The output must be stored in a CSV file containing only the matching instance pairs found by your system. The CSV file must have two columns: "left_instance_id" and "right_instance_id" and the file must be named "output.csv". The separator must be the comma.

Example of output.csv

left_instance_id	right_instance_id
00001	00002
00001	00004
...	...

More details about the datasets can be found in the dedicated Submitting section.

Datasets

#	Name	Description	Metadata	Download
1	NotebookToy	Sample notebook specifications (will not be used for final leaderboard)	128 instances 16 attributes 40 entities	Dataset X1 Dataset Y1
2	Notebook	Notebook specifications	538 instances 14 attributes 100 entities	Dataset X2 Dataset Y2
3	NotebookLarge	Notebook specifications	605 instances 14 attributes 158 entities	Dataset X3 Dataset Y3
4		Product specifications Kindly provided by Altosight	1356 instances 5 attributes 193 entities	Dataset X4 Dataset Y4

You can also download these datasets together with Snowman.

Snowman helps you to compare and evaluate your data matching solutions. You can upload experiment results from your data matching solution and then compare it easily with a gold standard, compare two experiment runs with each other or calculate binary metrics like precision or recall. Snowman is developed as part of a bachelor’s project at the Hasso Plattner Insitute, Potsdam, in collaboration with SAP SE.

You can download the latest release, which already includes the datasets provided for the contest.

Submitting

Participants are asked to use ReproZip to pack the solution they want to submit.

ReproZip is a tool for packing input files, libraries and environmental variables in a single bundle (in .rpz format), that can be reproduced on any machine.

A brief guide on how to use ReproZip to package your solution follows.

First of all, you have to install ReproZip on your machine. ReproZip can be installed via pip (pip install reprozip). More details about the installation can be found in the dedicated Documentation page.

Let’s suppose that your code is made up of a Python module called "greedy_matcher.py" and that you launch your program with the following command: python greedy_matcher.py.

First of all, ReproZip needs to track the code execution. For this to happen, it will be sufficient to run the following command: reprozip trace python greedy_matcher.py.

The code will be executed and a hidden folder (called ".reprozip-trace") will be created at the end of the process. This folder contains a "config.yml" file, that is a configuration file containing information about input/output files, libraries, environmental variables, etc. traced during the execution of your code. If you want to omit something you think it’s not useful to be packed, you can manually edit this file. Please, be sure not to remove any libraries or files needed to reproduce the code, to avoid the risk of it being not-reproducible.

Finally, to create the bundle, you have to run the following command: reprozip pack submission.rpz.

Please note that, if your code is made up of more than one file, even if they are written in different programming languages, you can trace the execution by using the option "--continue". You can find more details in the dedicated Documentation page.

At this point, a file called "submission.rpz" will be created and you can submit it using our dashboard.

Inside your code, it is important to refer to each input dataset X_i using its original name "X_i.csv".

Evaluation Process

Submitted solutions will be unpacked and reproduced using ReproUnzip on an evaluation server with the following characteristics:

Processor	32 CPU x 2.1 GHz
Main Memory	64 GB
Storage	2 TB
Operating System	Linux

In particular, before to run ReproUnzip, the X_i dataset (i.e., the original input you worked on) will be replaced with the Z_i dataset, which contains the hidden instances.

Here is the detailed sequence of operations used for the evaluation process:

reprounzip docker setup <bundle> <solution>, to unpack the uploaded bundle
reprounzip docker upload <solution> Z_j.csv:X_i.csv to replace the input datasets (X2.csv, X3.csv, ...) with the hidden ones (Z2.csv, Z3.csv, ...), potentially in shuffled order (e.g., X2 can be replaced by Z3)
reprounzip docker run <solution>
reprounzip docker download <solution> output.csv (i.e., you must produce just one file "output.csv" cumulative for all the datasets)
evaluation of "output.csv"

Note that, in order to be evaluated, your submission must be reproduced correctly (i.e., the process must end with the creation of the "output.csv" file without errors) and must run on all the datasets (total) in no more than a given timeout (please note that the whole cycle above must be executed within the defined timeout). Timeout value is available below and will be updated every time a new dataset is released.

TIMEOUT: 25 min (last updated: 6 April 2021)

For each dataset D_i we will compute resulting F-measure with respect to Z_i x Z_i. Submitted solutions will be ranked on average F-measure over all datasets.

Unfortunately, ReproUnzip sometimes prints useful information about the occurred errors on stdout, causing its exclusion from the submission log. In case you are stuck on a technical error preventing the successful reproduction of your submission, with no useful information appearing in the log, and not even the provided ReproUnzip commands can help you to find out the cause of the error, you can send us an email, so that we can check the content of stdout in order to detect the presence of any useful information about the error.

Rules

ACM SIGMOD 2021 Programming Contest is open to undergraduate and graduate students from degree-granting institutions all over the world. However, students associated with the organizers' institutions are not eligible to participate.
Teams must consist of individuals currently registered as graduate or undergraduate students at an accredited academic institution. A team may be formed by one or more students, who need not to be enrolled at the same institution. Several teams from the same institution can compete independently, but one person can be a member of only one team. There is no limit on team size. Teams can register on the contest site after 25 February 2021.
All submissions must consist only of code written by the team or open source licensed software (i.e., using an OSI-approved license). For source code from books or public articles, clear reference and attribution must be made. Final submissions must be made by 30 April 2021 (anywhere on Earth).
All teams must agree to license their code under an OSI-approved open source license. By participating in this contest, each team agrees to publish its source code. The finalists' implementations will be made public on the contest website.