Syllabus / Class Information

What is this class about?

Big data is everywhere. A fundamental goal across numerous modern businesses and sciences is to be able to utilize as many machines as possible, to consume as much information as possible and as fast as possible. The big challenge is "how to turn data into useful knowledge". This is a moving target as both the underlying hardware and our ability to collect data evolve. In this class, we will discuss how to design big data systems, data structures and algorithms for key data-driven areas, including distributed systems, parallel systems, storage systems, machine learning and neural networks such as Large Language Models (LLM). We will see how they all rely on the same set of very basic concepts and we will learn how to synthesize efficient solutions for any problem across these areas using those basic concepts.

What is a big data system?

Data systems are literally everywhere. We are using them directly or indirectly every day all day long for numerous basic or not so basic tasks, e.g., when we are buying coffee to when we are booking airplane tickets or training a neural network. They provide the backbone of all modern businesses to manage their data and of course they provide the backbone of online businesses and environments such as search engines. They are also used increasingly in science as data analytics becomes more and more the fundamental barrier in generating knowledge using Generative AI.

What is this class NOT about?

This class is not a traditional introduction on how we use a database system and how to write Python code . Instead, this is a software systems class about scalable data systems and their applications. You will learn how big data systems work at their core and how to design new systems for emerging data-driven applications and hardware. By the way, if you know how systems work, you also become better at using them!

Why take this class?

Data is everywhere. Every year we create even more data. As it stands, every two days we create as much data as much we created from the dawn of humanity up to 2003. Sciences, businesses, and everyday life are substantially affected. Data systems are in the middle of all this. Data systems are how we store and access data, i.e., they are the backbone of any data-driven application. It is a $100B industry, growing 10% every year [Economist, “Data, data everywhere”]. At the same time big data systems research and the whole industry are going through a major and continuous transition; given that new data-driven scenarios and applications continuously pop up, there is a continuous need to redefine what is a good data system design in such a dynamic environment. DS5110 exposes students to the core internals of big data systems making it possible to understand core trends in system design and to be one of the few who know how to design and evaluate systems. In addition, due to the way the course is taught (focus on interactive problem solving, and the latest research results) this is also a great class for those who want to understand what Data Science/Computer Science research is all about and how to engage in doing research.

Expected Learning Outcome

  • Learn state-of-the-art research and industry trends in big data systems.
  • Understand the tradeoffs in designing and implementing modern big data systems.
  • Be able to make design decisions in big data-driven scenarios.
  • Understand the fundamental principles that govern all systems out how these apply across diverse areas: Distributed and Parallel Systems, Storage Systems, Time-Series Deep learning and Large Language Models (LLM), Data Analytics and predictions.
  • Develop basic research skills: reading, writing and understanding research papers.
  • Deepen programming, debugging, and performance profiling skills.

Class Philosophy

DS5110 has unlimited office hours, unlimited late days for project deliverables, relies on the latest research papers instead of a standard text book, lectures are based on interaction and discussion instead of just “lecturing”, many of the assignments and problem sets are actually open research problems and most of all it is fun! The instructor and the TAs are here to help you every day and at all times throughout the semester. You may request as many meetings as you like and as much help as you want. The course is also geared towards engaging creative thinking and problem solving to give students a feeling of how computer science research takes place. Many of our students in the past have successfully engaged in research projects with UVA MLSys lab and published research papers.

How much work is it?

You may have heard stories about Data Engineering and wondering if DS5110 is going to be equally hard or you may have taken DS 6001 Data Engineering and wondering if this is going to be a similar amount of work. DS5110 is much more focused on design and implementation leading to an end-to-end pipeline or system prototype. In other words, in DS5110 you are more likely going to play with alternative ways to find out new ways to solve a specific problem.

Lectures

This class meets once a week.

Interaction in Every Class

While the instructor will do a few lectures through the semester, the class is going to be primarily discussion based. Think of this as an extended brainstorming session, a round table discussion about a specific problem in each class. The goal is to create the maximum possible interaction. Our discussion will aim at bringing up design trends and tradeoffs, as well as algorithmic issues. Another significant part of our discussions will focus on examining open problems and to highlight opportunities for innovation. At the very beginning of the semester, students will work on hands-on programming using AWS Academy platform and materials (e.g. video and labs). The instructor will do 4-5 lectures to provide the necessary background and advanced topics. After that, each class will be based on a student team presentation about their project which will work as a trigger for the day’s brainstorming. Depending on the needs of the class, we can schedule a meeting with individual project teams.

Office Hours & Labs

Interaction does not stop at lecture time. DS5110 is designed to maximize interaction as we truly believe this is the best way to learn; we offer a TEAM Channel, office hours, and labs.

Starting Week 1, Prof. Fox and TAs will hold office hours during the week and additional times periodically during the weekend. Labs are offered by the TAs. The goal of labs is to get hands-on help for the projects (coding). Bring your questions about specific project parts you need help with. Labs are the place to go when you have a persistent bug, when you need help with a specific tool for the project (e.g., for debugging or performance testing) or to get feedback about the quality of your coding.

Attendance

Based on the philosophy of the course, attendance in lectures and labs is required. Office hours is optional. The best way to learn, though, is through discussion and interaction with the instructor and the TAs. Our classes are not about “lecturing” - they are about interaction. We hope to see you there!

Lecture Recordings

All interactive sessions in class will be recorded and will be available online. So even if you miss a class it will be easy to catch up and you can also use these recordings to recite specific material throughout the semester (e.g., to prepare for midterms).

Sections

Another component of the course is sections. Sections are used to deliver material about the class, i.e., to go more deeply into some of the concepts discussed in class, , or to deliver background material that is needed to follow next week’s class or for the project. There will be no actual section meeting. Instead, all modules will be recored by the teaching staff and videos will be posted online. The material posted will be tailored to present a step by step guide for any of the topics presented to make it easy to follow everything without having to be physically present in an actual section. However, if there are still questions about the material presented in modules, you will be able to ask those questions either during the daily office hours and labs.

Research Sessions

Throughout the semester, on select days the instructor, and DASlab PhDs and postdocs, will discuss about research! First, DASlab researchers will present their recent work on data systems research and connect it with the material you are learning in class. Then, you will get the chance to talk with them about their research, open problems and be exposed to open research opportunities. Snacks and drinks will be provided.

Weekly Reviews

Each student will provide book chapter reviews per week. This prepares you to be ready for the discussion in class. Reviews should be no more than two page long. Each review should have text for at least the following 8 points:

  • What is the problem?
  • Why is it important?
  • Why is it hard?
  • Why do existing solutions not work?
  • What is the core intuition for the solution?
  • Does the paper prove its claims?
  • What is the setup of analysis/experiments? is it sufficient?
  • What are next steps?

Reviews will not be graded; you will use

Presentations

Each student will participate in the presentation during the semester. Presentations should be slides and coding demonstrations for their semester project. In addition, there should be detailed slides that describe the core idea of the project with references.

Your slides should not be a multiple sheets of bullet lists - your slides should follow the generic formatting, that is: make slides as simple as possible - avoid text unless absolutely needed - no full phrases unless you need to give an exact definition of something - use figures and visual examples, one slide one message=each slide should have a single goal that you should be able to describe within a single phrase.

Your slides should be reviewed by the instructor at least 24 hours before the class you are presenting. The final deck of slides should be available 30 minutes before class so we can upload it online.

Who should take this class?

IF you have taken DS5010 Programming for Data Science, DS 5111 Data Engineering, DS6013 Deep Learning: skip this next part. Else:

Background: Naturally, the more background you have the smoother your experience in DS5110 will be. Prior knowledge of Python programming, as well as a good understanding of software systems and in particular parallel and distributed systems, is very important for this class. Courses providing systems background (like DS5111 or equivalent) are essential. Good coding, algorithm, and data structure skills are also required.

If you are a graduate student and have taken a mix of machine learning systems (machine learning, and data engineering) classes in the past, then you will be OK and we will provide enough background so you can follow.

How can I do great in DS5110?

Just utilize all resources provided. Show up in class to participate in interactive sessions. There are also daily office hours and labs; show up as often as possible so we can help with anything you need! When you find yourself stuck with the project either with a design decision or just a bug, it is normal to struggle for a while — it is part of the learning process — but after some time grab your laptop and come by!

What can I do to prepare?

It is a good idea to spend some time preparing before the semester starts and during the early weeks of the semester even if you consider yourself an expert systems student. The best approach is to browse some fundamental readings in data systems architectures. We propose that you take a look at the following texts from the readings:

1. Get familiar with the very basics of traditional database architectures: Architecture of a Database System. By J. Hellerstein, M. Stonebraker and J. Hamilton. Foundations and Trends in Databases, 2007

2. Get familiar with very basics of modern database architectures: The Design and Implementation of Modern Column-store Database Systems. By D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden. Foundations and Trends in Databases, 2017

3. Get familiar with very basics of modern database architectures:

  • The Design and Implementation of Modern Column-store Database Systems. By D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden. Foundations and Trends in Databases, 201
  • Distributed Systems: Principles and Paradigms, Andrew S. Tanenbaum et al. (2nd Edition) Prentice Hall Publishers, NJ 07458, USA.
  • The Deep Learning textbook (MIT Press) by Ian Goodfellow, Yoshua Bengio and Aaron Courville
  • Distributed Systems and Cloud Computing: From Parallel Processing to the Internet of Things, Kai Hwang et al., Morgan Kaufmann Publishers, an imprint of Elsevier, Inc., Burlington, MA 01803, USA.
  • High Performance Computing, Kevin Dowd & Charles Severance, O'Reilly Publisher, CA 95472, USA.

Grading Scale

  • Class Participation: 20%
  • Labs: 20%
  • Proposal: 0%
  • Midway Check-In: 30%
  • Semester Project: 30%

Semester Project

Each student completes a semester project. There are two kinds of semester projects:

  • a systems project, and
  • a research project

Systems projects are tailored to provide background on state-of-the-art systems, data structures and algorithms. They include a design component and an implementation component in Python or Java, dealing with low level systems issues such as memory management, hardware conscious processing, parallel processing, managing read/write tradeoffs and scalability. This year’s systems project is about designing and implementing a key-value store in the form of a Log Structured Tree that can accommodate fast reads and writes.

The semester project, on the other hand, is much more tailored on design and proof of concept implementations trying to solve open problems. The semester projects are tailored to give a taste of research or development to students. When working on a research project, students will work closely with the instructor and members of MLSys lab on active research projects of the lab. Students will work on groups of three. Such projects are mainly about thinking, reading and writing and much less about coding although proof of concept implementations will be our end target in some cases.This year we will be working on the following research projects: 1) Self-designing NoSQL Key-value stores, 2) The Periodic Table of Data Structures, 3) Fast Neural Networks, 4) Adaptive Blockchain, and 5) Image Storage for AI.

Systems project can evolve into research projects throughout the semester for students that progress fast and want to continue with research.

What is a successful project? For systems projects we will give out specific functionality and performance metrics you have to achieve as part of the description of the project. For research projects we will give out specific questions you need to answer when we set-up each individual research project.

Evaluation: There is no final or midterms. At the end of the semester each student will have a meeting with the instructor and another meeting with the TAs where students will demonstrate their projects and answer design questions about the project. [Tip: Past experience shows that frequent participation in office hours, brainstorming sessions and modules means that the instructor and the TAs are very well aware of your system and your progress which makes the final evaluation a mere formality for these cases.]

Collaboration policy: The lab is an individual assignment: the deliverable should be personal, you must write the code of your system and all documentation and reports. Discussing the design and implementation problems with other students is allowed and encouraged! We will do so in the class and during office hours and brainstorming sessions. The projects are going to be in groups of three and we encourage discussions across teams but in the end each team should deliver a project that is clearly theirs.

Late days policy: All projects are due at the end of the semester and this is when they will be graded. The more input you give us, through the semester though, the more we can help you learn. In the systems project description you can find a detailed time- schedule that we propose you follow. Similarly, we will set up specific timelines for each research project. All timelines represent an ideal plan and you have the freedom to adjust according to your schedule.

There are no late days for reviews. This is because reviews are essential for you to follow each class.

Note: Experience says that every year a number of students cannot handle the freedom to self-pace, and end up significantly deviating from the schedule. We will send you frequent reminders but you should know that deviating from the schedule by more than a couple of weeks will most likely mean that you will not be able to finish the whole project by the end of the semester (unless you are an experienced systems student).

Midway Check-in: The goal here is to demonstrate that you are having decent progress and mainly to avoid falling behind. By early March each student working on a systems project should deliver 1) a design document and 2) one performance experiment that demonstrates an early result (10%). A template of the expected design document will be provided early in the semester.

Required Textbook

The class is about state-of-the-art data system design. There are reference books for that. We also use recent research papers and surveys which will be posted on the course website.

Feedback on Progress

We welcome feedback and ideas about the course at any point during the semester. Just come and chat with us during office hours! Tell us how you are keeping up and how we can make it easier for you. We provide feedback continuously. The main thing that you will need feedback on is your semester project and the paper reviews. The way to get feedback is to show up to our office hours and labs and share your design decisions, code, and test results with the staff or ask us to go with you over your paper reviews. In this way, you will get hands-on help and feedback. Specifically for reviews we will hold a special session every second week to “review the reviews”.

Online Discussion

We will use Piazza Discussions as a forum for online discussions. The links are posted on the class website. You are welcome to post any question that might help you understand the material better or help you with the project. Anonymous posting (to other students) will be enabled so that students feel more comfortable posting.

BASIC RULES: We only have a few basic rules so we can keep the forum functional and useful for the students as well as manageable for the staff.

  • We ask that you first search the forum well before posting a question so that we do not have duplicate entries.
  • Please make sure to stay on top of all Instructor/TA posts (especially those that are pinned). Anything we post at the forum we consider “known”.
  • Do not use the forum to post code or ask for help with debugging. While it can work in some cases remote debugging is a pain and takes a lot of time. We have labs every week. Join remotely and we will help you via a shared screen mode.
  • Do not use the forum for anything that is not about a technical question or a question about class logistics. If you want to discuss any concerns about your progress, fit for the class, or anything else you should come to office hours.

Plagiarism

You are responsible for understanding University of Virginia and School of Data Science policies on academic integrity and how to use sources responsibly.

Accessibility

University of Virginia are committed to providing an accessible academic community. The Disability Services Office offers a variety of accommodations for more information and do not hesitate to contact prof. Idreos directly, by email, with any questions or concerns you might have.

Assessment

DS5110 is a heavily hands-on oriented course that is structured in a very different way than other classes, valuing and promoting critical thinking. For most students this requires a transitions phase. Please check the syllabus and requirements carefully before committing to this course. In addition, keep in mind that taking this course successfully will in practice require participation in Lab sessions. They are critical for students to understand how to think about the material and how to design solutions. Especially if you do not have all the background described in the syllabus (i.e., if you have not taken a research oriented systems course with a systems project), you should budget time for frequent participation in both Labs and office hours and many hours of additional work every week to build the foundations needed.

Lecture: Lectures will be broadcasted live. Lectures will also be available for on- demand broadcast within 24 hours after each class. Students will be able to watch the live or recorded broadcast through their browser. The link to the broadcasts for DS5110 will be available through the AWS Academy platform or the course website for this class.

Participation: Students will be able to participate live in classes, office hours and labs via web-conference tools (we will use Zoom). The course staff will be online with Zoom during each session that is marked as “remote” and you will be able to actively interact with the staff. Other than standard chatting and talking features Zoom also offers screen sharing features which can be used for when you need help with specific issues such as debugging.

Grading: Even though we encourage students to utilize the opportunity to interact with the staff and participate in class live we know that for practical reasons this will not be possible for answering all questions in class. For this reason the rest of the course is self-learning and using Piazza as the communication tool. The final grade break down is as follows: the class participation grade (20%), Lab (20%), Midway check-in (30%), and Semester Project (30%).

Discussion Forum: There is a forum tailored using Piazza and please look at the class website for the forum link .

Office Hours and Labs: The schedule will be posted at the beginning of the semester on the class website and the forum.