JoBo documentation

 Startseite | Blog | Sitemap | Impressum | Login

Installation

There is no OS specific download anymore. Focus of JoBo are Java developers. Therefore I will not provide support for setting up a Java runtime environment for your workstation.

New to Java? Have a look at Suns Java Homepage!

XML configuration

There is a file <code class="filename">jobo.xml</code> in your JoBo directory. Editing this file will allow you to modify the default behaviour of JoBo. You should not use an text editor for this if you are not sure what you are doing. The best tool would be an XML editor that allows to use a DTD (e.g.Merlot, or look at XMLSoftware.com for other editors). Then you can be sure that JoBo will accept your XML document.

 

<JoBo>
<Robot>
<AgentName>JoBo</AgentName>
<StartReferer>http://www.matuschek.net/jobo.html</StartReferer>
<IgnoreRobotsTxt>false</IgnoreRobotsTxt>
<SleepTime>60000</SleepTime>
<MaxDepth>2</MaxDepth>
<WalkToOtherHosts>false</WalkToOtherHosts>
<Proxy>webproxy.my-provider.com:8080</Proxy>
<Bandwidth>1024</Bandwith>
<MaxDocumentAge>86400</MaxDocumentAge>
<AllowWholeHosy>true</AllowWholeHost>
<AllowWholeDomain>false</AllowWholeDomain>
<FlexibleHostCheck>true</FlexibleHostCheck>

<AllowedUrl>http://www.matuschek.net</AllowedUrl>

<VisitMany>http://www.matuschek.net</VisitMany>
</Robot>

<DownloadRuleSet>
<DownloadRule allow="false" minsize="0" maxsize="1024" mimeType="image/*"/>
<DownloadRule allow="true" mimeType="*/*"/>
</DownloadRuleSet>

<URLCheck>
<RegExpRule allow="true" pattern="." />
</URLCheck>

<LocalizeLinks>true</LocalizeLinks>

<JoBo>

<AgentName>

Defines the AgentName: header in the HTTP request. Usually there is no need to change this setting

<StartReferer>

JoBo uses HTTP Referer headers while walking through web sites. Using this setting you can define the Referer: header for the first document that will be retrieved

<IgnoreRobotsTxt>

If you set this to <code>true</code> JoBo will ignore the <code class="filename">robots.txt</code> file on the web site you want to download. You should not do this ! Why did I include this feature ? Because JoBo is open source, anybody that want to steal web sites could easily remove the robot.txt check in the source. It is also useful if you use it to mirror your own sites.

<SleepTime>

After every retrieved document JoBo delays before retrieving the next, because we don't want to produce much load on the destination web site. If you have a very low bandwith connection (e.g. dialup 33k), you can descrease this value (even to 0).

<MaxDepth>

Defines the maximal search depth of JoBo. JoBo will only download files that are that many clicks away from the starting page. Don't set this to very big values, because it is also used to detect loops. Usually JoBo will detect loops (it will not retrieve the same URL twice), but some web servers use special rewrite rules that will allow endless URLs like /a/a/a/a/a/a/a/a/a/a/a/a ....

<WalkToOtherHosts>

Setting this to <code>true</code>will allow JoBo to travel to any other website that is linked from the current site. Only change this setting if you know what you are doing !

<AllowWholeHost>

By default, JoBo will download all URLs from the start host. That means, if your start URL is www.matuschek.net/photo it will also download www.matuschek.net and www.matuschek.net/papers if it finds a link to this URLs somewhere.
This can cause problems on big web servers, therefore you can turn off this behavior with <code><AllowWholeHost>false</AllowWholeHost></code>. In this case JoBo will only collect documents below the start URL, e.g. if your start URL is www.matuschek.net/photo, it is allowed to collect www.matuschek.net/photo/index.html and www.matuschek.net/photo/pics.html but not www.matuschek.ent

Note that if you start with /something JoBo will also download /other, because it can't figure out if /something is a file or a directory. Therefore it assumes that everything without a trailing slash is a file and everything with a slash a the end ist a directory.

<AllowWholeDomain>

By default, JoBo does not travel to other hosts then the start host. But some companies use different web server names. If you want to download all stuff form the Apache group, you will need to grab www.apache.org, java.apache.org, jakarta.apache.org, xml.apache.org and others. For this purpose you can set <code>AllowWholeDomain</code> to true. If you start at www.apache.org, JoBo is allowed to walk to all web servers in the same domain (apache.org in this example).

<FlexibleHostCheck>

Some web servers use an inconsistent addressing scheme and use the host name for the web server with and without the prefix "www." (e.g. www.sourceforge.net and sourceforge.net).

Setting <code>FlexibleHostCheck</code> to true will not make a difference between the hostname with and without the leading "www." and allow JoBo to download both.

<Proxy>

Set this if you need to use a HTTP proxy. The syntax of the proxy setting is proxyname:port.

<Bandwidth>

Using this tag you can limit the bandwidth used by JoBo. This could be interesting if you don't want JoBo to use all of your available internet bandwidth.
The value is defined in Bytes per seconds. Setting this to 0 will disable bandwidth limitations.

<MaxDocumentAge>

With this setting you can limit the number of files that will be downloaded. JoBo will only download files that were modified during this number of seconds. In the GUI you can set this in days, but due to the internal architecture, you need to set this in seconds in the XML file.
Note that limiting this can bring new problems, e.g. if the start page was not modified in the given time period, JoBo will not download anything, even if there are pages on the server that were modified. It makes sense for web sites that are complete dynamic (all dynamic pages will be retrieved) and use e.g. includes images that do not change very often.

<EnableCookies>

If true, JoBo will accept cookies from web servers, otherwise it will not use cookies.

<AllowedUrl>

Usually JoBo will not travel to other web sites then the starting site. Using this directive you can allow JoBo to travel to some other sites (but not any). JoBo is allowed to travel to all URLs that will start with the given URL.

This setting makes sense if the web site has more then one name (e.g. www1.matuschek.net, www2.matuschek.net) and you want to download all servers.

You can have many <AllowedUrl> statements in your jobo.xml

<VisitMany>

Usually every URL is visited only once during a run. But on some web sites, URLs created by CGIs change the content (e.g. based on the referer or the current time). In this case it can be useful to visit an URL more then once.

<LocalizeLinks>

By default, JoBo will postprocess all retrieved HTML documents and try to replace included links by a localized version, that should also work on your local harddisk. You can turn off this behaviour with the <code><LocalizeLinks>false</LocalizeLinks></code> option.
Note that this option belongs to the JoBoBase section in the XML configuration, not in the WebRobot section.

Why this XML configuration ?

  1. I like it ;-)
  2. I'm not a good GUI programmer, therefore the GUI is as minimal as possible. If you want to implement a Swing GUI that allows to set these parameters, contact me.

Regular Expression URL checking

Regular Expression URL checking

The download configuration allows it to allow or deny downloads based on MIME type and document size. There is another module to deny downloads. This is the "Regular Expression URL check".

A RegExpRule basically is describes by a regular expression (called pattern):
<code><RegExpRule allow="false" pattern="xxxx" /></code>

Note that wildchars differ from "usual" wildchar. Therefore you use "\.gif$" instead of "*.gif". For more information about regular expressions have a look at the web (e.g. AskJeeves).

Examples:

<-- ignore GIF and JPEG files -->
<RegExpRule allow="false" pattern="\.gif$" />
<RegExpRule allow="false" pattern="\.jpg$" />
<RegExpRule allow="false" pattern="\.jpeg$" />

<-- ignore CGI (this isn't a 100% solution because sometimes -->
<-- it is not possible to find out what is a CGI) -->
<RegExpRule allow="false" pattern="cgi-bin" />
<RegExpRule allow="false" pattern="\.cgi$" />
<RegExpRule allow="false" pattern="\?" />

<-- ignore ASP -->
<RegExpRule allow="false" pattern="\.asp$" />

Archivierte Seite

Diese Seite wurde archiviert, d.h. sie wird nicht mehr aktiv gepflegt und die Informationen entsprechen unter Umständen nicht mehr dem aktuellen Stand.

Werbung
Look-Out
Talking about everything
Crazy audio
DIY audio projects and more
Anmesty International SchweizMenschenrechte für alle

Menschen für MenschenKarlheinz Böhms Äthiopienhilfe