A command-line SEO web scraping / analysis tool
- Run
git clone git://github.com/wheresmyjetpack/scrapeo.git
cd
into the scrapeo directory and runmake deploy
to install required packages into a virtualenv- Optional (With super user privileges)
ln -s $HOME/.virtualenvs/venv/bin/scrapeo /usr/local/bin/scrapeo
(Or somehwere in your path)
Alternative -- Install via pip
- Simply run
pip install scrapeo
to install from the Python Package Index - OR - - Clone the repo,
cd
into the newly created directory andpip install .
- It's recommended that you install scrapeo in a virtualenv instead of in your global site-packages directory
- Scrape and analyze elements like meta data and content from web pages
- Provide a quick and easy-to-use tool for those who prefer command-line interfaces
- Provide useful analytical and assessment data
- Installation via
pip
ormake
- Scrape pages from the command-line for meta tags by attribute-value pairs or by a single attribute's value
- Useful shortcuts like
-d
to get a page's meta description, or-c
to retrieve a canonical URL - Makefile for common development tasks, like building wheel, source, and deb packages
- make test - run all tests
- make deb - build Debian package (requires system packages in requirements-dev.txt)
- make source - build source tarball
- make wheel - build Python wheel
- make daily - make daily snapshot
- make deploy - create vitrual environment and install
- make install - install program
- make init - install all requirements
- make clean - clean project, remove .pyc and other temporary files
|-- docs
| |-- build
| | |--doctrees
| | `--text
| | `-- index.txt
| |-- Makefile
| `-- source
| |-- conf.py
| `-- index.rst
|-- scrapeo
| |-- __init__.py
| |-- utils
| | |-- __init__.py
| | `-- web_scraper.py
| |-- __init__.py
| |-- core.py
| |-- exceptions.py
| |-- main.py
| `-- helpers.py
|-- tests
| |-- data
| | `-- document.html
| |-- __init__.py
| |-- test_helpers.py
| `-- test_Scrapeo.py
|-- CHANGES.txt
|-- LICENSE.txt
|-- MANIFEST.in
|-- Makefile
|-- README.md
|-- README.rst
|-- requirements-dev.txt
|-- requirements.txt
`-- setup.py
- Move from Python's html.parser to the external
html5lib
package to help deal with different forms of empty tags, eg.<meta>
and<meta />
- Docs (generated using Sphinx and autodoc)
- Python 2 compatibility
-c
canonical link option added-s
option for specifiying what element attribute to scrape a value from-r
flag for scraping the content attribute of a robots meta tag-H
option for scraping the text from the first heading by type (h1,h2,h3,etc.)- Numerous bug fixes
- Test coverage
- Initial development release