adding notes on web-to-markdown
All checks were successful
deploy-docs / build-and-deploy (push) Successful in 1m22s
All checks were successful
deploy-docs / build-and-deploy (push) Successful in 1m22s
This commit is contained in:
67
projects/web-to-markdown/index.md
Normal file
67
projects/web-to-markdown/index.md
Normal file
@@ -0,0 +1,67 @@
|
||||
---
|
||||
title: "Scraping web articles for AI"
|
||||
description: "Scraping web articles has never been this easy. Heres some tools to make it even easier and help feed your AI more data."
|
||||
author: "wompmacho"
|
||||
date: '2026-05-17T12:15:13-04:00'
|
||||
lastmod: '2026-05-17'
|
||||
tags: ["golang", "automation", "markdown", "gemini-cli", "skills"]
|
||||
---
|
||||
|
||||
## Introduction
|
||||
|
||||
Managing knowledge often involves capturing articles and documentation from the web. To streamline this workflow, I developed `web-to-markdown` as a specialized utility written in Go that extracts article content, converts it to clean Markdown, and downloads inline images locally (*structured to compliment my hugo site*).
|
||||
|
||||
This is really good for quickly grabbing / sanitizing data and providing a great deal of context for agents. Adding this tool as a skill really compliments the planning stage of a project.
|
||||
|
||||
## The web-to-markdown Utility
|
||||
|
||||
The core utility (*[git repos](https://git.wompmacho.com/wompmacho/web-to-markdown)*) is a highly optimized CLI application built with Go `1.25.0+`. It is designed to be fast, reliable, and to produce a clean output structure.
|
||||
|
||||
### Key Features
|
||||
|
||||
- **Boilerplate Removal:** The tool leverages the `go-readability` library to intelligently isolate the main article content, automatically stripping out distracting elements such as advertisements, navigation bars, and footers.
|
||||
- **Concurrent Image Downloading:** Performance is maximized by utilizing native Goroutines to download all inline images simultaneously. This significantly reduces the time required to process image-heavy articles.
|
||||
- **Markdown Conversion:** The sanitized HTML is converted into readable Markdown using the `html-to-markdown` package.
|
||||
- **Intelligent Output Structure:** The utility generates a flat directory structure. It saves the main article as `index.md` and intelligently renames all downloaded images based on their `alt` text, `<figcaption>`, or surrounding header context. The markdown links are automatically rewritten to point to these local, contextualized image names.
|
||||
|
||||
### Usage Example
|
||||
|
||||
The utility is executed via the command line, accepting the target URL and optional flags for customization.
|
||||
|
||||
```sh
|
||||
./web-to-markdown -title "Go Concurrency Guide" -out "./docs/go" "https://example.com/go-concurrency"
|
||||
```
|
||||
|
||||
This command generates a structured output similar to the following:
|
||||
|
||||
```sh
|
||||
docs/go/
|
||||
└── go-concurrency-guide/
|
||||
├── index.md
|
||||
├── diagram-of-goroutines-a1b2c3.jpg
|
||||
└── author-profile-pic-f6g7h8.png
|
||||
```
|
||||
|
||||
## Extending the Agent with a Custom Skill
|
||||
|
||||
While the CLI tool is powerful on its own, manually executing it interrupts the creative or research flow. To solve this, we developed a custom Gemini CLI skill.
|
||||
|
||||
Skills in Gemini CLI are modular packages that inject specialized procedural knowledge and workflows into the agent's context window.
|
||||
|
||||
### The web-to-markdown Skill
|
||||
|
||||
The custom skill we created instructs the Gemini CLI agent on how and when to use the local `web-to-markdown` binary.
|
||||
|
||||
When a user issues a command like, *"Grab me a copy of `https://example.com/article`,"* the skill triggers the following automated workflow:
|
||||
|
||||
1. **Configuration Gathering:** The agent pauses and asks the user where the article should be saved and what the desired title should be.
|
||||
2. **Execution:** Once the parameters are confirmed, the agent autonomously executes the `run_shell_command` tool, invoking the `web-to-markdown` utility with the correct flags and URL.
|
||||
3. **Verification:** The agent verifies the success of the command by checking the target directory and informs the user that the Markdown file is ready for review.
|
||||
|
||||
### Skill Creation Process
|
||||
|
||||
The skill was generated using the built-in `skill-creator`. The process involved:
|
||||
|
||||
1. **Initialization:** Running the initialization script to scaffold the skill directory structure.
|
||||
2. **Drafting Instructions:** Writing the `SKILL.md` file, which includes the YAML frontmatter (defining the trigger description) and the step-by-step workflow instructions for the agent.
|
||||
3. **Packaging and Installation:** Compiling the skill into a `.skill` archive and installing it into the user's `~/.gemini/skills/` directory.
|
||||
Reference in New Issue
Block a user