Home >System Tutorial >LINUX >Canonical Launches Data Science Stack for ML Beginners
Data Science is the study of data. It involves collecting, analyzing, and interpreting large amounts of information. Data scientists use this information to make decisions, solve problems, and predict future trends.
Data scientists use various tools and techniques to analyze and interpret complex data sets. This helps businesses and organizations make better decisions.
If you're a beginner just starting with data science, you will probably face several challenges in setting up a proper data science environment.
Here are some reasons why setting up a data science environment can be challenging for beginners:
By understanding these challenges, beginners can better prepare themselves and seek the right resources and support to overcome them.
The initial hurdles can be challenging for new data scientists, but with persistence and consistent learning, the journey will become smoother.
Thanks to Canonical's Data Science Stack (DSS), setting up data science became much easier now. In this tutorial, we will discuss what is Data Science Stack and how to use it to setup a data science environment easily and quickly in Ubuntu operating systems.
Table of Contents
The Data Science Stack (DSS) by Canonical is an out-of-the-box solution for data scientists and machine learning engineers.
The Data Science Stack simplifies the setup process by providing a pre-configured environment that includes all the necessary tools and libraries for machine learning and data analysis.
By being designed to run on Ubuntu workstations and optimizing the use of GPUs, DSS can enhance the performance of machine learning models, which is particularly beneficial for computationally intensive tasks.
DSS allows users to focus more on the development and optimization of their models rather than the technicalities of the environment setup.
This can save a significant amount of time that would otherwise be spent on installing and configuring individual components.
The Data Science Stack (DSS) provides a comprehensive and integrated environment for data scientists and machine learning engineers. Here's what it offers:
Overall, DSS aims to provide a hassle-free and optimized environment for data science and machine learning, allowing users to focus on their core tasks rather than the technical setup and maintenance of their tools.
To begin using the Data Science Stack (DSS) for machine learning and data science, follow these steps to set up your environment:
DSS uses MicroK8s as its container orchestration system, which allows workloads to access the host's GPUs.
To Install MicroK8s on Ubuntu, run:
$ sudo snap install microk8s --channel 1.28/stable --classic
Next, enable the required services:
$ sudo microk8s enable storage dns rbac
The Data Science Stack is managed through a Command Line Interface (CLI).
Install DSS CLI with the following command:
$ sudo snap install data-science-stack --channel latest/stable
With these steps completed, you'll have the foundational components of DSS installed and ready to use. You can now proceed to set up your machine learning environments and start running your first notebooks using the DSS CLI.
After installing MicroK8s and the DSS CLI, the next step is to initialize DSS on top of MicroK8s and prepare MLflow for use.
To initialize DSS, you'll need to use thedss initializecommand, which sets up the necessary resources within the MicroK8s cluster.
$ dss initialize --kubeconfig="$(sudo microk8s config)"
The--kubeconfigflag is used to specify the path to the Kubernetes configuration file generated by MicroK8s.
The dss initialize command may take a few minutes to complete. During this time, the DSS CLI will display messages indicating the progress of the deployment. You will see messages similar to the following:
[INFO] Waiting for deployment my-tensorflow-notebook in namespace dss to be ready...
This message indicates that DSS is waiting for the deployment of the TensorFlow notebook to be ready. Be patient as the system sets up the environment and ensures all components are correctly configured.
Once the initialization is complete, you will see an output like below:
[INFO] Executing initialize command [INFO] Storing provided kubeconfig to /home/ostechnix/snap/data-science-stack/16/.dss/config [INFO] Waiting for deployment mlflow in namespace dss to be ready... [INFO] Deployment mlflow in namespace dss is ready [INFO] DSS initialized. To create your first notebook run the command: dss create Examples: dss create my-notebook --image=pytorch dss create my-notebook --image=kubeflownotebookswg/jupyter-scipy:v1.8.0
Now, you will be ready to start using the MLflow tracking server and other components provided by DSS.
You can then proceed to create and run your first machine learning notebook within the DSS environment.
To start your first Jupyter Notebook using the Data Science Stack (DSS), you'll need to use thedss createcommand, which allows you to specify the type of notebook you want to create.
Here, we are creating a TensorFlow notebook named my-tensorflow-notebook with CUDA support:
$ dss create my-tensorflow-notebook --image=kubeflownotebookswg/jupyter-tensorflow-cuda:v1.8.0
Upon successful creation of the Notebook, you will see an output like below:
[INFO] Executing create command [INFO] Waiting for deployment my-tensorflow-notebook in namespace dss to be ready... [INFO] Waiting for deployment my-tensorflow-notebook in namespace dss to be ready... [INFO] Waiting for deployment my-tensorflow-notebook in namespace dss to be ready... [INFO] Deployment my-tensorflow-notebook in namespace dss is ready [INFO] Success: Notebook my-tensorflow-notebook created successfully. [INFO] Access the notebook at http://10.152.183.253:80.
Once the notebook is ready, the command shows a URL that you can use to access the JupyterLab UI.
To start working with your notebook, open a web browser and enter the provided URL into the address bar.
As you see in the above output, we can access the newly created Notebook at http://10.152.183.253:80 from a Web browser. Replace the URL with your own.
This will take you to the JupyterLab interface where you can create new notebooks, upload data, and begin your machine learning tasks using TensorFlow and CUDA.
Remember that the IP address and port number in the URL may vary depending on your specific setup.
That's it. You can now start interact with your Notebook.
To quickly check the status of your Data Science Stack (DSS) environment, including the status of MLflow and the availability of GPU acceleration, you can use thedss statuscommand like below.
$ dss status
Thedss statuscommand will provide you with a summary of the current state of your DSS environment. Here's an example of what the output might look like:
[INFO] MLflow deployment: Ready [INFO] MLflow URL: http://10.152.183.157:5000 [INFO] GPU acceleration: Disabled
Explanation of Output:
To verify, open the MLflow URL http://10.152.183.157:5000 from your web browser.
This will open the MLflow dashboard in your web browser.
Experiments tab in the MLflow dashboard:
Since it is our new installation, there are no experiments yet. To create an experiment use the mlflow experiments CLI.
Models tab in MLflow Dashboard:
To view the list of available commands for the Data Science Stack (DSS), you can use the dss command with the --help option.
Run the following command in your terminal:
$ dss --help
This will display a list of commands along with a brief description of their purpose.
If you need more detailed information about a specific DSS command, you can use the command followed by the --help option.
For example, to get details about the initialize command, you would run:
$ dss logs --help
If you don't need DSS anymore, you can use the dss purge command to remove the Data Science Stack from your MicroK8s cluster.
To remove DSS, execute the following command in your terminal:
$ dss purge
This command will completely remove all DSS components, including Jupyter Notebooks, the MLflow server, and any data stored within the DSS environment.
It's important to note that this action is irreversible, and all data within the DSS environment will be permanently lost. Make sure to back up any important data before proceeding with the purge.
While the dss purge command removes the DSS components from the MicroK8s cluster, it does not remove the DSS CLI or the MicroK8s cluster itself. If you wish to remove these as well, you will need to delete their respective snaps:
To remove the DSS CLI, use the following command:
$ sudo snap remove data-science-stack
To remove MicroK8s, use the following command:
$ sudo snap remove microk8s
By following these steps, you can completely remove the Data Science Stack (DSS) and its associated components from your system.
A: Data Science Stack (DSS) is a comprehensive, ready-to-run environment for machine learning and data science. It is designed to simplify the setup and management of data science tools and frameworks, allowing users to focus on their core tasks rather than the intricacies of environment configuration.
Q: What tools are included in DSS?A: DSS includes a variety of open-source tools such as Jupyter Notebook, MLflow, and popular machine learning frameworks like TensorFlow and PyTorch. It also provides a container orchestration system, MicroK8s, for managing workloads.
Q: How do I install DSS?A: To install DSS, you need to have Ubuntu 22.04 LTS or Ubuntu 24.04 LTS, an internet connection, and Snap installed. Then, you can install MicroK8s and the DSS CLI using Snap commands. For detailed instructions, refer to the official documentation or installation guide.
Q: How do I start a Jupyter Notebook with DSS?A: You can start a Jupyter Notebook with DSS using the dss create command, specifying the desired image for your notebook. For example, to start a TensorFlow notebook, you would use dss create my-tensorflow-notebook --image=kubeflownotebookswg/jupyter-tensorflow-cuda:v1.8.0.
Q: What is the purpose of the dss status command?A: The dss status command provides a quick overview of the current state of your DSS environment, including the status of MLflow and the availability of GPU acceleration. It helps you verify that all components are functioning correctly.
Q: How do I remove DSS from my environment?A: To remove DSS, you can use the dss purge command, which will remove all DSS components, including Jupyter Notebooks and the MLflow server. Note that this action is irreversible and will result in the loss of all data within the DSS environment.
Q: Where can I find more information about DSS commands?A: You can find detailed information about DSS commands by using the dss --help command to list all available commands and dss
Yes, DSS is based on open-source tools and is free to use.
Q: Is DSS suitable for beginners in data science?A: Yes, DSS is designed to be user-friendly and can be a great tool for beginners as it reduces the complexity of setting up a data science environment. It provides a ready-made and optimized environment that allows users to start working on data science projects quickly.
In summary, the Data Science Stack (DSS) simplifies the setup for data science tasks. It provides a collection of tools that work well together, making it easier to start projects quickly.
Whether you're new to data science or experienced, DSS helps you focus on your work by handling the technical setup. It's a reliable tool that supports efficient data analysis and model building.
Resource:
Related Read:
The above is the detailed content of Canonical Launches Data Science Stack for ML Beginners. For more information, please follow other related articles on the PHP Chinese website!