ACM SIGMOD 2022 Programming Contest

Description

The task is to perform blocking for Entity Resolution, i.e., quickly filter out non-matches (tuple pairs that are unlikely to represent the same real-world entity) in a limited time to generate a small candidate set that contains a limited number of tuple pairs for matching.

Participants are asked to solve the task on two product datasets. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns). We will refer to each of these datasets as D_i.

For each dataset D_i, participants will be provided with the following resources:

X_i : a subset of the instances in D_i
Y_i : matching pairs in X_i x X_i. (The pairs not in Y_i are non-matching pairs.)
Blocking Requirements: the size of the generated candidate set (i.e., the number of tuple pairs in the candidate set)

Note that matching pairs in Y_i are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C). For a matching pair id₁ and id₂ with id₁ < id₂, Y_i only includes (id₁, id₂) and doesn't include (id₂, id₁).

Your goal is to write a program that generates, for each X_i dataset, a candidate set of tuple pairs for matching X_i with X_i. The output must be stored in a CSV file containing the ids of tuple pairs in the candidate set. The CSV file must have two columns: "left_instance_id" and "right_instance_id" and the output file must be named "output.csv". The separator must be the comma. Note that we do not consider the trivial equi-joins (tuple pairs with left_instance_id = right_instance_id) as true matches. For a pair id₁ and id₂ (assume id₁ < id₂), please only include (id₁, id₂) and don't include (id₂, id₁) in your "output.csv".

Solutions will be evaluated over the complete dataset D_i. Note that the instances in D_i (except the sample X_i) will not be provided to participants. More details are available in the Evaluation Process section.

Both X_i and Y_i are in CSV format.

Example of dataset X_i

instance_id	attr_name_1	attr_name_2	...	attr_name_k
00001	value_1	null	...	value_k
00002	null	value_2	...	value_k
...	...	...	...	...

Example of dataset Y_i

left_instance_id	right_instance_id
00001	00002
00001	00003
...	...

More details about the datasets can be found in the dedicated Datasets section.

Example of output.csv

left_instance_id	right_instance_id
00001	00002
00001	00004
...	...

Output.csv format: The evaluation process expects "output.csv" to have 3000000 tuple pairs. The first 1000000 tuple pairs are for dataset X₁ and the remaining pairs are for datasets X₂. Please format "output.csv" accordingly. You can check out our provided baseline solution on how to produce a valid "ouput.csv".

Datasets

#	Name	Description	Number of rows	Blocking Requirements	Download Sample
1	Notebook	Notebook specifications	About 1000000	Candidate Set Size = 1000000	Dataset X1 Dataset Y1
2		Product specifications Kindly provided by Altosight	About 1000000	Candidate Set Size = 2000000	Dataset X2 Dataset Y2

Submission

Quick start:

Participants are suggested to start with our provided baseline solution.

Submission:

Each team can make at most 10 submissions each day. The number of remaining submissions will be reset at 11:59:59 PM Eastern Standard Time.

Participants are asked to use ReproZip to pack the solution they want to submit.

ReproZip is a tool for packing input files, libraries and environmental variables in a single bundle (in .rpz format), that can be reproduced on any machine.

A brief guide on how to use ReproZip to package your solution follows.

First of all, you have to install ReproZip on your machine. ReproZip can be installed via pip (pip install reprozip). More details about the installation can be found in the dedicated Documentation page.

Let’s suppose that your code is made up of a Python module called "blocking.py" and that you launch your program with the following command: python blocking.py.

First of all, ReproZip needs to track the code execution. For this to happen, it will be sufficient to run the following command: reprozip trace python blocking.py.

The code will be executed and a hidden folder (called ".reprozip-trace") will be created at the end of the process. This folder contains a "config.yml" file, that is a configuration file containing information about input/output files, libraries, environmental variables, etc. traced during the execution of your code. If you want to omit something you think it’s not useful to be packed, you can manually edit this file. Please, be sure not to remove any libraries or files needed to reproduce the code, to avoid the risk of it being not-reproducible.

Finally, to create the bundle, you have to run the following command: reprozip pack submission.rpz.

Please note that, if your code is made up of more than one file, even if they are written in different programming languages, you can trace the execution by using the option "--continue". You can find more details in the dedicated Documentation page.

At this point, a file called "submission.rpz" will be created and you can submit it using our dashboard.

Inside your code, it is important to refer to each input dataset X_i using its original name "X_i.csv".

Evaluation Process

Submitted solutions will be unpacked and reproduced using ReproUnzip on an evaluation server (Azure Standard F16s v2) with the following characteristics:

Processor	16 CPU x 2.7 GHz
Main Memory	32 GB
Storage	32 GB
Operating System	Ubuntu 20.04.3 LTS

In particular, before to run ReproUnzip, the X_i dataset (i.e., the original input you worked on) will be replaced with the complete dataset D_i, which contains the hidden instances.

Here is the detailed sequence of operations used for the evaluation process:

reprounzip docker setup <bundle> <solution>, to unpack the uploaded bundle
reprounzip docker upload <solution> D_i.csv:X_i.csv to replace the input datasets (X1.csv, X2.csv, ...) with the complete ones (D1.csv, D2.csv, ...)
reprounzip docker run <solution>
reprounzip docker download <solution> output.csv (i.e., you must produce just one file "output.csv" cumulative for all the datasets)
evaluation of "output.csv"

Important Notes: Your solution will be evaluated on D_i, but in order to be evaluated, your submission must meet the following requirements:

The program must be reproduced correctly (i.e., the process must end with the creation of the "output.csv" file without errors).
The program must be finished within 35 min otherwise it incurs "timeout" error (i.e., the total time limit for blocking on two datasets is 35 min).
The size of the candidate set (i.e., the number of rows in the output.csv) must equal to the size specified in the blocking requirements of D_i.

Evaluation Metrics: For each dataset D_i, we will compute resulting recall score as follows: $$Recall = {\text{Number of true matches retained in the candidate set} \over \text{Total number of true matches in ground truth}}$$ Note that the trivial equi-joins (tuple pairs with left_instance_id = right_instance_id) are not considered as true matches. Submitted solutions will be ranked on average Recall over all datasets. Ties will be broken with running time. Note running time may vary slightly when submitting the same solution multiple times. After the final submission deadline, we will run solutions with ties on recall ten times to get averaged running times.

Unfortunately, ReproUnzip sometimes prints useful information about the occurred errors on stdout, causing its exclusion from the submission log. In case you are stuck on a technical error preventing the successful reproduction of your submission, with no useful information appearing in the log, and not even the provided ReproUnzip commands can help you to find out the cause of the error, you can send us an email, so that we can check the content of stdout in order to detect the presence of any useful information about the error.