Heritrix

Heritrix: Contenido

1 Heritrix

2 Ficheros Arc

3 Herramientas para procesar los ficheros Arc

4 Proyectos que usan Heritrix

5 Referencias

Heritrix

Heritrix es un rastreador (o crawler) de ficheros web a través de internet. Su licencia es open-source y esta escrito completamente en JAVA. Su interfaz de configuración es accesible usando un navegador web, haciéndolo muy versátil y cómodo de usar, aunque también puede ser lanzando desde línea de comandos.

Heritrix fue desarrollado conjuntamente por "Internet Archive" y "Nordic National Libraries" a principios de 2003. La primera versión fue publicada en enero de 2004 y ha sido continuamente actualizado por los miembros de "Internet Archive" y terceras partes.

Ficheros Arc

Heritrix por defecto almacena los recursos web que crawlea en un fichero Arc. El formato Arc ha sido usado por el "Internet Archive" desde 1996 para almacenar sus archivos webs.

Un fichero Arc almacena múltiples recursos en un único fichero con el fin de evitar la gestión de una gran cantidad de archivos pequeños. El archivo consta de una secuencia de registros de URL, cada una con una cabecera que contiene metadatos acerca de la forma en que el recurso se pidió seguida de la cabecera HTTP y la respuesta.

Ejemplo:

filedesc://IA-2006062.arc 0.0.0.0 20060622190110 text/plain 76 1 1 InternetArchive URL IP-address Archive-date Content-type Archive-length http://foo.edu:80/hello.html 127.10.100.2 19961104142103 text/html 187 HTTP/1.1 200 OK Date: Thu, 22 Jun 2006 19:01:15 GMT Server: Apache Last-Modified: Sat, 10 Jun 2006 22:33:11 GMT Content-Length: 30 Content-Type: text/html <html> Hello World!!! </html>

Herramientas para procesar los ficheros Arc

Heritrix incluye un conjunto de herramientas a través de la línea de comandos llamado arcreader el cual puede ser usado para extraer el contenido de un fichero Arc. El siguiente comando lista todas las URLs y metadatos almacenados en el fichero Arc:

arcreader IA-2006062.arc

El siguiente comando extrae hello.html del ejemplo de fichero Arc anterior, suponiendo que el registro empieza en la posición 140:

arcreader -o 140 -f dump IA-2006062.arc

Otras herramientas:

Arc processing tools

WERA (Web ARchive Access)

Proyectos que usan Heritrix

Patrimoni Digital de Catalunya, experiencias del primer año

Buscanding

Referencias

Burner, M. (1997). «Crawling towards eternity – building an archive of the World Wide Web». Web Techniques 2 (5). http://www.webtechniques.com/archives/1997/05/burner/.

http://crawler.archive.org/

Categoría:
Internet

Игры ⚽ Поможем написать курсовую

Mira otros diccionarios:

Heritrix — Dernière version 3.0.0 (12 décembre 2009) [ … Wikipédia en Français
Heritrix — Infobox Software name = Heritrix caption = Screenshot of Heritrix Admin Console. developer = latest release version = 2.0.1 latest release date = release date|2008|08|07 operating system = Linux/Unix like/Windows(unsupported) programming language … Wikipedia
heritrix — her·i·trix … English syllables
heritrix — … Useful english dictionary
Web archiving — is the process of collecting portions of the World Wide Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists… … Wikipedia
heretrix — variant of heritrix * * * heretrix see heritrix … Useful english dictionary
Web crawler — For the search engine of the same name, see WebCrawler. For the fictional robots called Skutters, see Red Dwarf characters#The Skutters. Not to be confused with offline reader. A Web crawler is a computer program that browses the World Wide Web… … Wikipedia
Internet Archive — Not to be confused with the arXiv. For help citing the Internet Archive in English Wikipedia, see Wikipedia:Using the Wayback Machine. Coordinates: 37°46′56.3″N 122°28′17.65″W / … Wikipedia
National and University Library of Iceland — Landsbókasafn Íslands Háskólabókasafn (English: The National and University Library of Iceland) is the national library of Iceland which also functions as the university library of the University of Iceland. The library was established on… … Wikipedia
Libarc — is a C++ library that accesses contents of GZIP compressed ARC files. These ARC files are generated by the Internet Archive s Heritrix web crawler.This allows you to Open and scan contents of GZIP compressed ARC Files. It also allows you to get… … Wikipedia

Los diccionarios y las enciclopedias sobre el Académico

Heritrix

Contenido

Heritrix

Ficheros Arc

Herramientas para procesar los ficheros Arc

Proyectos que usan Heritrix

Referencias

Mira otros diccionarios:

Compartir el artículo y extractos

Los diccionarios y las enciclopedias sobre el Académico

Wikipedia Español

Heritrix

Contenido

Heritrix

Ficheros Arc

Herramientas para procesar los ficheros Arc

Proyectos que usan Heritrix

Referencias

Mira otros diccionarios:

Compartir el artículo y extractos

Link directo