Home >Backend Development >Python Tutorial >How to build a new Harlequin adapter with Poetry
Welcome to the first post in LETSQL's tutorial series!
In this blog post, we take a detour from our usual theme of data pipelines to demonstrate how to create and publish a Python package with Poetry, using DataFusion as an example.
Harlequin is a TUI client for SQL databases known for its light-weight extensive support for SQL databases. It is a versatile tool for data exploration and analysis workflows. Harlequin provides an interactive SQL editor with features like autocomplete, syntax highlighting, and query history. It also has a results viewer that can display large result sets. However, Harlequin did not have a DataFusion adapter before. Thankfully, it was really easy to add one.
In this post, We'll demonstrate these concepts by building a Harlequin adapter for DataFusion. And, by way of doing so, we will also cover Poetry's essential features, project setup, and the steps to publish your package on PyPI.
To get the most out of this guide, you should have a basic understanding of virtual environments, Python packages and modules, and pip.
Our objectives are to:
By the end, you'll have practical experience with Poetry and an understanding of modern Python package management.
The code implemented in this post is available on GitHub and available in PyPI.
Harlequin is a SQL IDE that runs in the terminal. It provides a powerful and feature-rich alternative to traditional command-line database tools, making it versatile for data exploration and analysis workflows.
Some key things to know about Harlequin:
DataFusion is a fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.
DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.
It ships with it its own CLI, more information can be found here.
Poetry is a modern, feature-rich tool that streamlines dependency management and packaging for Python projects, making development more deterministic and efficient.
From the documentation:
Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on, and it will manage (install/update) them for you.
Poetry offers a lockfile to ensure repeatable installs and can build your project for distribution.
A Harlequin adapter is a Python package that allows Harlequin to work with a database system.
An adapter is a Python package that declares an entry point in the harlequin.adapters group. That entry point should reference a subclass of the HarlequinAdapter abstract base class.
This allows Harlequin to discover installed adapters and instantiate a selected adapter at run-time
In addition to the HarlequinAdapter class, the package must also provide implementations for HarlequinConnection, and HarlequinCursor. A more detailed description can be found on this
guide.
The first step for developing a Harlequin adapter is to generate a new repo from the existing harlequin-adapter-template
GitHub templates are repositories that serve as starting points for new projects. They provide pre-configured files, structures, and settings that are copied to new repositories, allowing for quick project setup without the overhead of forking.
This feature streamlines the process of creating consistent, well-structured projects based on established patterns.
The harlequin-adapter-template comes with a poetry.lock file and a pyproject.toml file, in addition to some boilerplate code for defining the required classes.
Let's explore the essential files needed for package distribution before we get into the specifics of coding.
The pyproject.toml file is now the standard for configuring Python packages for publication and other tools. Introduced in PEP 518 and PEP 621, this TOML-formatted file consolidates multiple configuration files into one. It enhances dependency management by making it more robust and standardized.
Poetry, utilizes pyproject.toml to handle the project's virtual environment, resolve dependencies, and create packages.
The pyproject.toml of the template is as follows:
[tool.poetry] name = "harlequin-myadapter" version = "0.1.0" description = "A Harlequin adapter for 2da0dc96d989c23227dfa4db76fedb27." authors = ["Ted Conbeer 6da378a5ba0e452039972dddc494b9b0"] license = "MIT" readme = "README.md" packages = [ { include = "harlequin_myadapter", from = "src" }, ] [tool.poetry.plugins."harlequin.adapter"] my-adapter = "harlequin_myadapter:MyAdapter" [tool.poetry.dependencies] python = ">=3.8.1,f681e766a553307a093124d04136e013=4.6.0", python = "706516177a012f41ddcd115c830c977f will automatically update this section.
The [tool.poetry.dev-dependencies] subsection is where you declare development-only dependencies, like testing frameworks, linters, etc.
The [build-system] section is used to store build-related data. In this case, it specifies the build-backend as "poetry.core.masonry.api". In a narrow sense, the core responsibility of a
build-backend is to build wheels and sdist.
The repository also includes a poetry.lock file, a Poetry-specific component generated by running poetry install or poetry update. This lock file specifies the exact versions of all dependencies and sub-dependencies for your project, ensuring reproducible installations across different environments.
It's crucial to avoid manual edits to the poetry.lock file, as this can cause inconsistencies and installation issues. Instead, make changes to your pyproject.toml file and allow Poetry to automatically update the lock file by running poetry lock.
Per Poetry's installation warning
::: {.warning}
Poetry should always be installed in a dedicated virtual environment to isolate it from the rest of your system. It should in no case be installed in the environment of the project that is to be managed by Poetry.
:::
Here we will presume you have access to Poetry by running pipx install poetry
With our file structure clarified, let's begin the development process by setting up our environment. Since our project already includes pyproject.toml and poetry.lock files, we can initiate our environment using the poetry shell command.
This command activates the virtual environment linked to the current Poetry project, ensuring all subsequent operations occur within the project's dependency context. If no virtual environment exists, poetry shell automatically creates and activates one.
poetry shell detects your current shell and launches a new instance within the virtual environment. As Poetry centralizes virtual environments by default, this command eliminates the need to locate or recall the specific path to the activate script.
To verify which Python environment is currently in use with Poetry, you can use the following commands:
poetry env list --full-path
This will show all the virtual environments associated with your project and indicate which one is currently active.
As an alternative, you can get the full path of only the current environment:
poetry env info -p
With the environment activated, use poetry install to install the required dependencies. The command works as follows
To complete the environment setup, we need to add the datafusion library to our dependencies. Execute the following command:
poetry add datafusion
This command updates your pyproject.toml file with the datafusion package and installs it. If you don't specify a version, Poetry will automatically select an appropriate one based on available package versions.
To create a Harlequin Adapter, you need to implement three interfaces defined as abstract classes in the harlequin.adapter module.
The first one is the HarlequinAdapter.
#| eval: false #| code-fold: false #| code-summary: implementation of HarlequinAdapter class DataFusionAdapter(HarlequinAdapter): def __init__(self, conn_str: Sequence[str], **options: Any) -> None: self.conn_str = conn_str self.options = options def connect(self) -> DataFusionConnection: conn = DataFusionConnection(self.conn_str, self.options) return conn
The second one is the HarlequinConnection, particularly the methods execute and get_catalog.
#| eval: false #| code-fold: false #| code-summary: implementation of execution of HarlequinConnection def execute(self, query: str) -> HarlequinCursor | None: try: cur = self.conn.sql(query) # type: ignore if str(cur.logical_plan()) == "EmptyRelation": return None except Exception as e: raise HarlequinQueryError( msg=str(e), title="Harlequin encountered an error while executing your query.", ) from e else: if cur is not None: return DataFusionCursor(cur) else: return None
For brevity, we've omitted the implementation of the get_catalog function. You can find the full code in the adapter.py file within our GitHub repository.
Finally, a HarlequinCursor implementation must be provided as well:
#| eval: false #| code-fold: false #| code-summary: implementation of HarlequinCursor class DataFusionCursor(HarlequinCursor): def __init__(self, *args: Any, **kwargs: Any) -> None: self.cur = args[0] self._limit: int | None = None def columns(self) -> list[tuple[str, str]]: return [ (field.name, _mapping.get(field.type, "?")) for field in self.cur.schema() ] def set_limit(self, limit: int) -> DataFusionCursor: self._limit = limit return self def fetchall(self) -> AutoBackendType: try: if self._limit is None: return self.cur.to_arrow_table() else: return self.cur.limit(self._limit).to_arrow_table() except Exception as e: raise HarlequinQueryError( msg=str(e), title="Harlequin encountered an error while executing your query.", ) from e
Your adapter must register an entry point in the harlequin.adapters group using the packaging software you use to build your project.
If you use Poetry, you can define the entry point in your pyproject.toml file:
[tool.poetry.plugins."harlequin.adapter"] datafusion = "harlequin_datafusion:DataFusionAdapter"
An entry point is a mechanism for code to advertise components it provides to be discovered and used by other code.
Notice that registering a plugin with Poetry is equivalent to the following pyproject.toml specification for entry points:
[project.entry-points."harlequin.adapter"] datafusion = "harlequin_datafusion:DataFusionAdapter"
The template provides a set of pre-configured tests, some of which are applicable to DataFusion while others may not be relevant. One test that's pretty cool checks if the plugin can be discovered, which is crucial for ensuring proper integration:
#| eval: false #| code-fold: false if sys.version_info f92766c91ae8e699e6db3bc44a8ef3ea None: PLUGIN_NAME = "datafusion" eps = entry_points(group="harlequin.adapter") assert eps[PLUGIN_NAME] adapter_cls = eps[PLUGIN_NAME].load() assert issubclass(adapter_cls, HarlequinAdapter) assert adapter_cls == DataFusionAdapter
To make sure the tests are passing, run:
poetry run pytest
The run command executes the given command inside the project’s virtualenv.
With the tests passing, we're nearly ready to publish our project. Let's enhance our pyproject.toml file to make our package more discoverable and appealing on PyPI. We'll add key metadata including:
These additions will help potential users find and understand our package more easily.
classifiers = [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Topic :: Software Development :: User Interfaces", "Topic :: Database :: Database Engines/Servers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: Implementation :: CPython" ] readme = "README.md" repository = "https://github.com/mesejo/datafusion-adapter"
For reference:
We're now ready to build our library and verify its functionality by installing it in a clean virtual environment. Let's start with the build process:
poetry build
This command will create distribution packages (both source and wheel) in the dist directory.
The wheel file should have a name like harlequin_datafusion-0.1.1-py3-none-any.whl. This follows the standard naming convention:
To test the installation, create a new directory called test_install. Then, set up a fresh virtual environment with the following command:
python -m venv .venv
To activate the virtual environment on MacOS or Linux:
source .venv/bin/activate
After running this command, you should see the name of your virtual environment (.venv) prepended to your command prompt, indicating that the virtual environment is now active.
To install the wheel file we just built, use pip as follows:
pip install /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl
Replace /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl with the actual path to the wheel file you want to install.
If everything works fined, you should see some dependencies installed, and you should be able to do:
harlequin -a datafusion
Congrats! You have built a Python library. Now it is time to share it with the world.
The best practice before publishing to PyPI is to actually publish to the Test Python Package Index (TestPyPI)
To publish a package to TestPyPI using Poetry, follow these steps:
Create an account at TestPyPI if you haven't already.
Generate an API token on your TestPyPI account page.
Register the TestPyPI repository with Poetry by running:
poetry config repositories.test-pypi https://test.pypi.org/legacy/
To publish your package, run:
poetry publish -r testpypi --username __token__ --password <token>
Replace d6fb5a6237ab04b68d3c67881a9080fa with the actual token value you generated in step 2. To verify the publishing process, use the following command:
python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple 232e112a1ffb9f21e3b1b7ffee4c43c2
This command uses two key arguments:
Replace 232e112a1ffb9f21e3b1b7ffee4c43c2 with your specific package name (e.g., harlequin-datafusion if following this post). For additional details, consult the information provided in this post.
To publish to the actual Python Package Index (PyPI) instead:
Create an account at https://pypi.org/ if you haven't already.
Generate an API token on your PyPI account page.
Run:
poetry publish --username __token__ --password <token>
The default repository is PyPI, so there's no need to specify it.
Is worth noting that Poetry only supports the Legacy Upload API when publishing your project.
Manually publishing each time is repetitive and error-prone, so to fix this problem, let us create a GitHub Action to
publish each time we create a release.
Here are the key steps to publish a Python package to PyPI using GitHub Actions and Poetry:
Set up PyPI authentication: You must provide your PyPI credentials (the API token) as GitHub secrets so the GitHub Actions workflow can access them. Name these secrets something like PYPI_TOKEN.
Create a GitHub Actions workflow file: In your project's .github/workflows directory, create a new file like publish.yml with the following content:
name: Build and publish python package on: release: types: [ published ] jobs: publish-package: runs-on: ubuntu-latest permissions: contents: write steps: - uses: actions/checkout@v3 - uses: actions/setup-python@v4 with: python-version: '3.10' - name: Install Poetry uses: snok/install-poetry@v1 - run: poetry config pypi-token.pypi "${{ secrets.PYPI_TOKEN }}" - name: Publish package run: poetry publish --build --username __token__
The key is to leverage GitHub Actions to automate the publishing process and use Poetry to manage your package's dependencies and metadata.
Poetry is a user-friendly Python package management tool that simplifies project setup and publication. Its intuitive command-line interface streamlines environment management and dependency installation. It supports plugin development, integrates with other tools, and emphasizes testing for robust code. With straightforward commands for building and publishing packages, Poetry makes it easier for developers to share their work with the Python community.
At LETSQL, we're committed to contributing to the developer community. We hope this blog post serves as a straightforward guide to developing and publishing Python packages, emphasizing best practices and providing valuable resources.
To subscribe to our newsletter, visit letsql.com.
As we continue to refine the adapter, we would like to provide better autocompletion and direct reading from files (parquet, csv) as in the DataFusion-cli. This requires a tighter integration with the Rust library without going through the Python bindings.
Your thoughts and feedback are invaluable as we navigate this journey. Share your experiences, questions, or suggestions in the comments below or on our community forum. Let's redefine the boundaries of data science and machine learning integration.
Thanks to Dan Lovell and Hussain Sultan for the comments and the thorough review.
The above is the detailed content of How to build a new Harlequin adapter with Poetry. For more information, please follow other related articles on the PHP Chinese website!