Algolia DocSearch Scraper in Docker for GitHub Actions
Go to file
2023-08-29 06:22:52 -04:00
action.yml chore: add marketplace branding 2021-07-06 07:34:58 -04:00
Dockerfile feat: separate base image to make github action faster 2021-07-06 07:09:10 -04:00
Dockerfile.base feat: separate base image to make github action faster 2021-07-06 07:09:10 -04:00
entrypoint.sh chore: remove debug commands 2021-07-06 05:21:47 -04:00
LICENSE feat: add readme and metadata 2021-07-06 07:31:46 -04:00
Makefile feat: separate base image to make github action faster 2021-07-06 07:09:10 -04:00
README.md chore: use new syntax 2023-08-29 00:44:09 -07:00
renovate.json Add renovate.json 2023-08-29 10:20:47 +00:00

Algolia DocSearch Scraper in Docker for GitHub Actions

Run self-hosted Algolia DocSearch scraper in Docker with Github Actions

  • The base image can make GitHub Actions workflow faster (less than 1 min image build time) and keep the scraper up-to-date automatically thanks to Docker Hub base image auto build
  • You can get some config examples at algolia/docsearch-configs

Usage

Basic usage:

- name: Push indices to Algolia
  uses: signcl/docsearch-scraper-action@master
  env:
    APPLICATION_ID: ${{ secrets.ALGOLIA_APPLICATION_ID }}
    API_KEY: ${{ secrets.ALGOLIA_API_KEY }}
    CONFIG: '{"index_name": "docs","start_urls": ["https://example.com/"],"sitemap_urls": ["https://example.com/sitemap.xml"],"sitemap_alternate_links": true,"stop_urls": [],"selectors": {"lvl1": "header h1","lvl2": "article h2","lvl3": "article h3","lvl4": "article h4","lvl5": "article h5, article td:first-child","lvl6": "article h6","text": "article p, article li, article td:last-child"},"strip_chars": " .,;:#","custom_settings": {"separatorsToIndex": "_","attributesForFaceting": ["language","version","type","docusaurus_tag"],"attributesToRetrieve": ["hierarchy","content","anchor","url","url_without_anchor","type"]}}'

The tricky part is how to pass CONFIG to the scraper. The above example won't work if your configuration contains XPath select like ul[contains(@class,'menu__list')]. A more elegant way is committing your config as algolia.json into the repository and checkout within the workflow:

- uses: actions/checkout@v2

- name: Get the content of algolia.json as config
  id: algolia_config
  run: echo "config=$(cat algolia.json | jq -r tostring)" >> $GITHUB_OUTPUT

- name: Push indices to Algolia
  uses: signcl/docsearch-scraper-action@master
  env:
    APPLICATION_ID: ${{ secrets.ALGOLIA_APPLICATION_ID }}
    API_KEY: ${{ secrets.ALGOLIA_API_KEY }}
    CONFIG: ${{ steps.algolia_config.outputs.config }}