Welcome to Data analysis with Python - Summer 2021¶
NOTE: please check for the course practicalities, e.g., how to pass the course, schedules, and deadlines, at the official course page. This course is available until September 15, 2021 (recommended latest starting date August 1, 2021).
In this course an overview is given of different phases of the data analysis pipeline using Python and its data analysis ecosystem. What is typically done in data analysis? We assume that data is already available, so we only need to download it. After downloading the data it needs to be cleaned to enable further analysis. In the cleaning phase the data is converted to some uniform and consistent format. After which the data can, for instance, be
combined or divided into smaller chunks
grouped or sorted,
condensed into small number of summary statistics
numerical or string operations can be performed on the data
The point is to manipulate the data into a form that enables discovery of relationships and regularities among the elements of data. Visualization of data often helps to get a better understanding of the data. Another useful tool for data analysis is machine learning, where a mathematical or statistical model is fitted to the data. These models can then be used to make predictions of new data, or can be used to explain or describe the current data.
Python is a popular, easy to learn programming language. It is commonly used in the field of data analysis, because there are very efficient libraries available to process large amounts of data. This so called data analysis stack includes libraries such of NumPy, Pandas, Matplotlib and SciPy that we will familiarize ourselves with during this course.
No previous knowledge of Python is needed as will start with a quick introduction to Python. It is however assumed that you have good programming skills in some language. In addition, linear algebra and probability calculus are prerequisites of this course. The course lasts for seven weeks and gives 5 credit units. It is recommended that you do this course in the end of bachelor degree or in the beginning of masters degree; preferably before the course “Introduction to Data Science”.
Conduct of the course¶
Refer to the official course page.
Note that this course requires a lot of work! Depending on your background it can take something between 5 to 20 hours per week. In addition to reading the course material, you sometimes may need to consult the online documentation of Python or its various libraries.
Discussion forum¶
A Telegram chat room for the course has been opened. We recommend that you use the channel either through a web browser or the Telegram desktop application.
You can reach the channel through this link: https://t.me/tkt_dap. The browser version can be reached here Telegram.
The discussion channel is based on peer support. The teachers of the course are participating in the discussion on voluntary basis if time permits.
Please refrain from posting code or stack traces as text or attachments in the chat room. Use tmc paste or another paste site instead and just post the link. Also try to avoid posting complete solutions or spoilers for exercises.
Software libraries used¶
Library |
Documentation |
---|---|
numpy |
|
pandas |
|
matplotlib |
|
scikit-learn |
|
scipy |
|
jupyter |
- Initializing course environment
- Running Python code
- Frequently asked questions
- How to load a file that resides in the src folder?
- Tests complain about missing attribute assert_called, assert_called_once, …
- Problem when testing part01-e09
- ModuleNotFoundError: No module named ‘somelibrary’
- What version of Python should I use? What is the name of the executable?
- Stuff generally doesn’t seem to work in the conda prompt on Windows
- TMC test fails but all other commands (including submit) work
- When logging to tmc first time, whats the server address it asks?
- TMC login crashes (with mention of tmc-cli.log)
- tmc not found
- tmc test says: “Test results: 0/0 tests passed”
- I cannot understand the error message from a failed test case