COMS 4232: Advanced Algorithms
Project Description Handout
Mar 15, 2025
1 Project Information
The goal of the project is for you will delve into a particular topic in more detail, together with a team. The final projects can be of two types.
Theory-based: Read a few recent research papers on a coherent topic and investigate a concrete related question leading to some original contribution. The project will describe what is the main problem, why it is important and what are the main ideas (both in terms of algorithms, as well as analysis).
The original contribution can involve:
• developing an algorithm and/or analysis for a problem that was studied earlier or a variant of it that was not.
• studying an algorithm (existing or new) for a particular case (e.g., with extra assumptions, random input, a concrete dataset).
• generalizing an existing algorithm to handle a bit (or a lot) more general inputs than strictly proven in the paper.
• showing that a (natural) algorithm does not work on some counter-examples you design.
• designing a new interesting question, even if you don’t solve it (or just give some oreliminary easy observations).
It may be within more applied sciences: e.g., perhaps in your area, certain theoretical algorithms can be modified to have even better performance, due to special properties of the datasets, etc.
Implementation-based: Implement some of the algorithms from the class (or from other theoretical literature), and perhaps apply to your area of interest/expertise, using real-world datasets. One required part of such a project is some kind of original contribution, in the form of a new(ish) message/conclusion: e.g., comparison among a few algorithms/models (“this is better than that, at least for this datasets”), or non-trivial adaptation of an algorithm to an application area (“this type of algorithms can be modi- fied/simplified as follows to really shine for this kind of problems”). Writing guidelines are as above; plus you will have to describe the setup of the experiments (computers, datasets used, which algorithms are compared).
Teams: you are allowed to have a team, of 2-4 people in total per team. Single-person teams require special permission from the instructor.
Topic: the topic of your project must be within the scope of Theoretical Computer Science, and preferrably algorithmic. In particular, the focus is on algorithms with provable guarantees (for the implementation type, you may compare such theoretical guarantees with heuristics though, or simply use ideas inspired by a theoretical algorithm). See details and examples in the next section. Feel free to schedule meeting times with the staff to discuss ideas.
Deliverables: there will project proposal and the final report, as described below. The level of the write-up is as if you are presenting a lecture to your classmates. In fact, you are strongly encouraged to share your document with your classmates (non-teammates) so that they provide you comments on your write-up.
• Project proposal: expected length is 1-2 pages. The proposal needs to explain precisely the focus of your project, including: 0) your team, 1) description of the problem you are focusing on, 2) possible references (I might suggest to add/pivot to some others), and, if necessary, 3) your goal. Be as precise/mathematical as possible. At this stage you are expected to have read the introduction of the relevant papers (but not the technical details). It is ok (but not encouraged) to suggest more than 1 possible directions.
• Final project: final project due. Expected length: about 8 pages, excluding references (11pt font, 1inch margins). You can include details after page 8 as appendix.
Below are more details and suggestions on project topics.
2 Possible topics
Focus on “recent” papers (in the last about 20 years), which are in Theoretical Computer Science. Such papers typically appear in conferences such as: STOC (Symposium on Theory of Computing), FOCS (Symposium on Foundations of Computer Science), SODA (Symposium on Discrete Algorithms), ICALP (International Colloquium on Automata, Languages and Programming), ITCS (Innovations in Theoreti- cal Computer Science), and others. It is also ok for the papers to be from theoretical Machine Learning, from conferences such as COLT (Conference On Learning Theory), NeuIPS, ICML (International Con- ference on Machine Learning); as long as the papers are theoretical (eg, give guarantees on runtime and correctness/performance of proposed algorithms). Common tools for searching for papers include: Google Scholar, DBLP, authors’ homepages.
For an implementation project, think about which algorithms you want to implement, which datasets to use, and what your main hypotheses are. Your datasets would ideally have a few million points (the bigger the dataset, the more clear comparative studies would be), and be either from a public source (I can help with getting them), or generated from your research (e.g., word2vec vectors, or some other deep- learned representation). Some good resources are: https://archive.ics.uci.edu/ml/datasets.html and http://www.image-net.org/.
For the theory-based project, consider: if you are planning to (try to) extend the results, why/how do you think this is may be possible? (In the project proposal, it’s obviously hard to be too specific, but you need to show understanding of what it would take to make progress on the problem.)
Below is a mix of examples/ideas for different kind of projects, grouped by class modules. Disclaimer: this is by no means a uniform sampling of topics, but rather (heavily) biased by topics I know (of) better.
While an item may include one to a few papers, you may want to look up other related paper by either looking at the papers cited or papers that cite it (using Google Scholar).
Generic resources (uncategorized):
• The website http://sublinear.info lists a number of open questions in the area of sublinear algorithms, which includes sublinear space (sketching/streaming) problems.
• A semester-long program on “Foundations of Data Science” at Simons Institute for the Theory of Computing (explore the talks in workshops in particular):
https://simons.berkeley.edu/programs/datascience2018
• Similarly, program on “Data Structures and Optimization for Fast Algorithms”
https://simons.berkeley.edu/programs/data-structures-optimization-fast-algorithms/workshops (see, e.g., the workshops on “Sketching and Algorithm Design” and “Dynamic Graphs and Algo-rithm Design”).
• Workshop on Theoretical Perspectives on LLMs:
https://encore.ucsd.edu/workshop-on-theoretical-perspectives-on-llms/
• Algorithms with predictions: is a type of algorithms for many standard problems, where the algo- rithm has a good-but-not-perfect predictor (think an ML system that can discover trends that could be useful for the algorithm, but the prediction may be wrong). Here’s a workshop on the topic (see also the work by people on the list): https://theory.stanford.edu/ sergei/stoc2022alps.html [CL15];
• Theory ML (not quite the topic of the class, but some of you may be interested): see publications of, say, Daniel Hsu, Sanjeev Arora, Ankur Moitra. (Again, rememeber the focus is on algorithms, their efficiency, with provable guarantees on correctness/performance).
Parallel algorithms:
• MapReduce/parallel algorithms for shortest paths/graph spanners [BDG+ 20, ASZ19, Li19].
• MapReduce/parallel algorithms for approximate maximum flow problem [AKL+ 24].
• MapReduce graph connectivity: [BDE+ 19, ASW19, ASS+ 18],
• Other MapReduce graph algorithms: [CLM+ 18, ABB+ 19, AK18]
• Parallel algorithms (via sketching/compressing): [BR17] or [ANOY14, IMS17] and references therein.
• Relation between parallel algos and transformers [SHT24b, SHT24a]. Some directions: prove tighter bounds, show empirical success for other problems beyond k-hop (where a parallel algorithm is more efficient).
Linear programming:
• Faster (fastest) linear programming algorithms: [JSWZ21, CLS19], as well as [LS14, LS15]. SDP: [HJS+ 22]. Possible directions: improving concrete data structures used there.
• Faster LPs for packing and covering problems: [AO15] and references therein.
• Speeding-up stochastic gradient descent: [All17, AZ17].
Graph algorithms:
• Faster max-flow algorithms: [KLS20, LS19, Mad16]. most recent (near-optimal): [CKL+ 22].
• min cost flow: also [CKL+ 22].
• Other related graph problems: [CMSV17, MST15]
• Approximate near-linear time max-flow algorithms. [KPSW19, Pen16].
• Random walks: [AI16], [AKM+ 19] (small space), [LMOS19] (in parallel).
• Uncapacitated min-cost flow: [She17, Li19, ASZ19]. Directions: simplify the algorithms, improve bounds for some steps (even if it doesn't improve overall algorithms) – for example for the low hop emulators.
Spectral graph algorithms:
• Spectral theory for directed graphs. [CKK+ 18], and [BLSS20, CKP+ 16b, CKP+ 16a].
• Linear system (of equations) solvers for Laplacians: [KS16] and references therein such as [KOSA13], or [CKM+ 14].
• Spectral sparsification algorithms: [KPPS16] and references therein.
• Higher order spectral gap, Cheeger's inequality: [KLL+ 13, LGT14].
• Expanders, small-set expansion: [GT14, GT13].
Nearest Neighbor Search/similarity search:
• Using LSH for kernel desity estimation (a problem in ML): [CKNS20] and references therein.
• Data-dependent Locality Sensitive Hashing: in [AR15], data-dependent LSH obtains the fastest (approximate) NNS algorithms for l1; l2 spaces. [AR15, ANRW17]. A simpler (but sub-optimal) algorithm is [ARS17].
Directions: develop a simpler algorithm (a-la above [ARS17]) but which also obtains (near) optimal exponents.
• NNS for general norm/distance [ANN+ 18a,ANN+ 18b]; a simpler approach is from [KNT21, AC21]. Directions: 1) while the data structure is has efficient space and query time, it has exponential- time preprocessing (even for concrete spaces such as Schatten-p); can one get polynomial time preprocessing? 2) can we use the approach from [ANN+ 18a] to obtain improved NNS algorithms for edit distance or Earth-Mover Distance (or other spaces). For example, can we obtain O(log d) approximation for NNS under edit distance? 3) can one get O(logd) approximation for a general norm, while also obtaining true query time to be subpolynomial in n.
• Related to above: average sketching: [BBM+ 24]. Many Directions: average sketching bounds for other distances/settings (even for special distributions, like uniformly random)? What about 1-way or 2-way communication protocols (where Alice and Bob communicate to solve the problem)? Other reasonable variants of the definitions?
• Approximate closest pair problem: we have a set of n vectors vi ∈ {±1}d, and the goal is to find a pair such that Ⅱvi − vjⅡ ≤ t unless we have that Ⅱvi − vjⅡ ≥ t(1 + ∈) for all i j .
This problem can be solved using LSH, which would give a solution with runtime n2-Θ(∈) . In [Val15], Valiant showed that, in the random case we can do faster. Let d ≥ log n. Suppose all vectors vi are chosen iid at random, except for a pair vi* , vj * which are correlated: Ⅱvi − vjⅡ ≤ ∈ . Valiant shows one can solve the problem in time O(n1.6 (d/∈)O(1)). The key ingredient is faster matrix multiplication (e.g., Strassen’s algorithm).
See also: [Val15, KKK16, ACW16], also perhaps [CW16]. A recent paper combining matrix multi- plication and LSH: [AZ23]. Directions: improve the result here by finding some new tensors.
Directions: Assuming that matrix multiplication can be done in n2+o(1) time, is it possible to solve (random instance of) closest pair problem in time n1+o(1) (improving over [ACW16])?
• Nearest neighbor search for dimension d = Θ(log n). In general, there a lower bound strongly suggesting it’s very hard, if not impossible, to get LSH exponent P < 2c2-1/1 whenever d = ω(log n) [AR16, ALRW17]. However, if the dimension is proportional to log n, one can obtain exponent
P < 2c2-1/1 [BDGL16].
Directions: Is it possible to obtain even better exponents? Alternatively, what is the limit for this method (ie, is there a lower bound in the spirit of those two papers)?
• Directions: rigorous analysis for the graph-based NNS (alternative approach to NNS, which is very popular in practice https://big-ann-benchmarks.com/neurips23.html). See this recent paper for inspiration [IX23].
Streaming/Sketching:
• Streaming algorithms for computing MST: [CCAJ+ 23, CJLW22].
• Frequency moment estimation: [BZ24];
• Latest and best on approx counting (Morris’ algorithm): [NY20].
• Better sketching for heavy hitters and related problems, for insertion-only streams: [BCIW16, BKSV14].
• Sketching for specialized metrics. Example 1: Earth-Mover distance (aka, transportation, Wasser- stein metric), and related metrics. See eg [CJLW20] and reference therein.
Directions: is there a polylogarithmic size and approximation sketch for Wasserstein-2 metrics over 2D plane (a version of the Earth-Mover Distance. W-2 is common in vision. See details and a lower bound in [ANN16]. A related question is Wasserstein-p, for any p > 1, including p = ∞ . Directions: a related direction here is even computing this distance between two strings for p > 1. The best algorithm is actually to compute spanner+run fastest max-flow algorithm; see [].
• Sketching of the shift metric. For two strings x, y ∈ {0, 1}d , define sh(x, y) = minj∈[d] Ⅱσj (x) — yⅡ1 , where σj (x) is cyclical rotation of x by j positions. E.g., sh(00011, 11100) = 1 (rotate twice, and then change a 0 into a 1). This distance is a simpler version of the classic edit distance, and has been used as a “testbed” for the more complicated edit distance. See [Wai20, AIK08, AHIK13, KN06]. Directions: constant approximation in smaller (polylog?) space. Perhaps average sketching?
Hashing algorithms:
• Linear probing: [BK24];
• [AKK+ 19]
• Tabulation Hashing: [Tho13, PT12, AT19]
• Consistent Hashing: [MTZ18] (and see references).
• One of the best (last) words on dictionary problem: [Yu19] (and see references).
Other:
• 3SUM problem with preprocessing: [GGH+ 19],