I am working on a project that requires me to extract the name and location of a Python package installed using the pip install
command.
A web page contains a code
element that contains multiple lines of text and bash commands. I want to write a JS code that can parse this text and find the packages and their location in the text.
For example, if the text is:
$ pip install numpy pip install --global-option build_ext -t ../ pandas>=1.0.0,<2 sudo apt update pip uninstall numpy pip install "requests==12.2.2"
I want to get results similar to this:
[ { "name": "numpy", "position": 14 }, { "name": "pandas", "position": 65 }, { "name": "requests", "position": 131 } ]
How do I implement this functionality in JavaScript?
P粉7736596872023-09-08 15:53:55
You can see the code I explained in this answer.
Here is another similar solution, more based on regular expressions:
const pipOptionsWithArg = [ '-c', '--constraint', '-e', '--editable', '-t', '--target', '--platform', '--python-version', '--implementation', '--abi', '--root', '--prefix', '-b', '--build', '--src', '--upgrade-strategy', '--install-option', '--global-option', '--no-binary', '--only-binary', '--progress-bar', '-i', '--index-url', '--extra-index-url', '-f', '--find-links', '--log', '--proxy', '--retires', '--timeout', '--exists-action', '--trusted-host', '--cert', '--client-cert', '--cache-dir', ]; const optionWithArgRegex = `( (${pipOptionsWithArg.join('|')})(=| )\S+)*`; const options = /( -[-\w=]+)*/; const packageArea = /["']?(?<package_part>(?<package_name>\w[\w.-]*)([=<>~!]=?[\w.,<>]+)?)["']?(?=\s|$)/g; const repeatedPackages = `(?<packages>( ${packageArea.source})+)`; const whiteSpace = / +/; const PIP_COMMAND_REGEX = new RegExp( `(?<command>pip install${optionWithArgRegex}${options.source})${repeatedPackages}`.replaceAll(' ', whiteSpace.source), 'g' ); export const parseCommand = (command) => { const matches = Array.from(command.matchAll(PIP_COMMAND_REGEX)); const results = matches.flatMap((match) => { const packagesStr = match?.groups.packages; if (!packagesStr) return []; const packagesIndex = command.indexOf(packagesStr, match.index + match.groups.command.length); return Array.from(packagesStr.matchAll(packageArea)) .map((packageMatch) => { const packagePart = packageMatch.groups.package_part; const name = packageMatch.groups.package_name; const startIndex = packagesIndex + packagesStr.indexOf(packagePart, packageMatch.index); const endIndex = startIndex + packagePart.length; return { type: 'pypi', name, version: undefined, startIndex, endIndex, }; }) .filter((result) => result.name !== 'requirements.txt'); }); return results; };
P粉1945410722023-09-08 10:50:01
Here is an optional solution, try using a loop instead of a regular expression:
The idea is to find the lines containing the text pip install
. These lines are the lines we are interested in. Then, break the command into words and loop over them until you reach the package part of the command.
First, we will define a regular expression for the package. Remember, a package can be something like pip install 'stevedore>=1.3.0,<1.4.0' "MySQL_python==1.2.2"
:
const packageArea = /(?<=\s|^)["']?(?<package_part>(?<package_name>\w[\w.-]*)([=<>~!]=?[\w.,<>]+)?)["']?(?=\s|$)/;
NOTENamed grouping, package_part
is used to identify the "package with version" string, and package_name
is used to extract Package names.
We have two types of command line arguments: options and flags.
The problem withoptions is that we need to understand that the next word is not the package name, but the options value.
So, I first listed all the options in the pip install
command:
const pipOptionsWithArg = [ '-c', '--constraint', '-e', '--editable', '-t', '--target', '--platform', '--python-version', '--implementation', '--abi', '--root', '--prefix', '-b', '--build', '--src', '--upgrade-strategy', '--install-option', '--global-option', '--no-binary', '--only-binary', '--progress-bar', '-i', '--index-url', '--extra-index-url', '-f', '--find-links', '--log', '--proxy', '--retires', '--timeout', '--exists-action', '--trusted-host', '--cert', '--client-cert', '--cache-dir', ];
I then wrote a function that I will use later to decide what to do when it sees an argument:
const handleArgument = (argument, restCommandWords) => { let index = 0; index += argument.length + 1; // +1 是为了去掉 split 时的空格 if (argument === '-r' || argument === '--requirement') { while (restCommandWords.length > 0) { index += restCommandWords.shift().length + 1; } return index; } if (!pipOptionsWithArg.includes(argument)) { return index; } if (argument.includes('=')) return index; index += restCommandWords.shift().length + 1; return index; };
This function receives the recognized parameters and the rest of the command, split into words.
(Here you start to see the "index counter". Since we also need to find the position of each discovery, we need to keep track of the current position in the original text).
In the last few lines of the function, you can see that I handle both cases --option=something
and --option something
.
Now the main parser splits the raw text into lines and then into words.
Every operation must update the global index to keep track of where we are in the text, and this index helps us search and find within the text without getting stuck in the wrong substring, Use indexOf(str, counterIndex)
:
export const parseCommand = (multilineCommand) => { const packages = []; let counterIndex = 0; const lines = multilineCommand.split('\n'); while (lines.length > 0) { const line = lines.shift(); const pipInstallMatch = line.match(/pip +install/); if (!pipInstallMatch) { counterIndex += line.length + 1; // +1 是为了换行符 continue; } const pipInstallLength = pipInstallMatch.index + pipInstallMatch[0].length; const argsAndPackagesWords = line.slice(pipInstallLength).split(' '); counterIndex += pipInstallLength; while (argsAndPackagesWords.length > 0) { const word = argsAndPackagesWords.shift(); if (!word) { counterIndex++; continue; } if (word.startsWith('-')) { counterIndex += handleArgument(word, argsAndPackagesWords); continue; } const packageMatch = word.match(packageArea); if (!packageMatch) { counterIndex += word.length + 1; continue; } const startIndex = multilineCommand.indexOf(packageMatch.groups.package_part, counterIndex); packages.push({ type: 'pypi', name: packageMatch.groups.package_name, version: undefined, startIndex, endIndex: startIndex + packageMatch.groups.package_part.length, }); counterIndex += word.length + 1; } } return packages; };