There is no OS specific download anymore. Focus of JoBo are Java developers. Therefore I will not provide support for setting up a Java runtime environment for your workstation.
New to Java? Have a look at Suns Java Homepage!
There is a file <code class="filename">jobo.xml</code> in your JoBo directory. Editing this file will allow you to modify the default behaviour of JoBo. You should not use an text editor for this if you are not sure what you are doing. The best tool would be an XML editor that allows to use a DTD (e.g.Merlot, or look at XMLSoftware.com for other editors). Then you can be sure that JoBo will accept your XML document.
<DownloadRule allow="false" minsize="0" maxsize="1024" mimeType="image/*"/>
<DownloadRule allow="true" mimeType="*/*"/>
<RegExpRule allow="true" pattern="." />
Defines the AgentName: header in the HTTP request. Usually there is no need to change this setting
JoBo uses HTTP Referer headers while walking through web sites. Using this setting you can define the Referer: header for the first document that will be retrieved
If you set this to <code>true</code> JoBo will ignore the <code class="filename">robots.txt</code> file on the web site you want to download. You should not do this ! Why did I include this feature ? Because JoBo is open source, anybody that want to steal web sites could easily remove the robot.txt check in the source. It is also useful if you use it to mirror your own sites.
After every retrieved document JoBo delays before retrieving the next, because we don't want to produce much load on the destination web site. If you have a very low bandwith connection (e.g. dialup 33k), you can descrease this value (even to 0).
Defines the maximal search depth of JoBo. JoBo will only download files that are that many clicks away from the starting page. Don't set this to very big values, because it is also used to detect loops. Usually JoBo will detect loops (it will not retrieve the same URL twice), but some web servers use special rewrite rules that will allow endless URLs like /a/a/a/a/a/a/a/a/a/a/a/a ....
Setting this to <code>true</code>will allow JoBo to travel to any other website that is linked from the current site. Only change this setting if you know what you are doing !
By default, JoBo will download all URLs from the start host. That means, if your start URL is www.matuschek.net/photo it will also download www.matuschek.net and www.matuschek.net/papers if it finds a link to this URLs somewhere.
This can cause problems on big web servers, therefore you can turn off this behavior with <code><AllowWholeHost>false</AllowWholeHost></code>. In this case JoBo will only collect documents below the start URL, e.g. if your start URL is www.matuschek.net/photo, it is allowed to collect www.matuschek.net/photo/index.html and www.matuschek.net/photo/pics.html but not www.matuschek.ent
Note that if you start with /something JoBo will also download /other, because it can't figure out if /something is a file or a directory. Therefore it assumes that everything without a trailing slash is a file and everything with a slash a the end ist a directory.
By default, JoBo does not travel to other hosts then the start host. But some companies use different web server names. If you want to download all stuff form the Apache group, you will need to grab www.apache.org, java.apache.org, jakarta.apache.org, xml.apache.org and others. For this purpose you can set <code>AllowWholeDomain</code> to true. If you start at www.apache.org, JoBo is allowed to walk to all web servers in the same domain (apache.org in this example).
Some web servers use an inconsistent addressing scheme and use the host name for the web server with and without the prefix "www." (e.g. www.sourceforge.net and sourceforge.net).
Setting <code>FlexibleHostCheck</code> to true will not make a difference between the hostname with and without the leading "www." and allow JoBo to download both.
Set this if you need to use a HTTP proxy. The syntax of the proxy setting is proxyname:port.
Using this tag you can limit the bandwidth used by JoBo. This could be interesting if you don't want JoBo to use all of your available internet bandwidth.
The value is defined in Bytes per seconds. Setting this to 0 will disable bandwidth limitations.
With this setting you can limit the number of files that will be downloaded. JoBo will only download files that were modified during this number of seconds. In the GUI you can set this in days, but due to the internal architecture, you need to set this in seconds in the XML file.
Note that limiting this can bring new problems, e.g. if the start page was not modified in the given time period, JoBo will not download anything, even if there are pages on the server that were modified. It makes sense for web sites that are complete dynamic (all dynamic pages will be retrieved) and use e.g. includes images that do not change very often.
Usually JoBo will not travel to other web sites then the starting site. Using this directive you can allow JoBo to travel to some other sites (but not any). JoBo is allowed to travel to all URLs that will start with the given URL.
Usually every URL is visited only once during a run. But on some web sites, URLs created by CGIs change the content (e.g. based on the referer or the current time). In this case it can be useful to visit an URL more then once.
By default, JoBo will postprocess all retrieved HTML documents and try to replace included links by a localized version, that should also work on your local harddisk. You can turn off this behaviour with the <code><LocalizeLinks>false</LocalizeLinks></code> option.
Note that this option belongs to the JoBoBase section in the XML configuration, not in the WebRobot section.
<FormField name="user" value="daniel"/>
<FormField name="pass" value="test"/>
Note that the URL you define in the form handler is not the URL of the HTML page where the form is located but the URL of the script that is defined in the action attribute of the form. This allows to use a single form handler if different forms refer to the same action URL. This happens on many sites without a central login page, where the user can login at different pages.
<DownloadRule allow="false" mimeType="application/zip" />
<DownloadRule allow="false" mimeType="*/*" minSize="100000" />
A DownloadRule has the attributes allow, mimeType, minSize, maxSize. minSize and maxSize are optional arguments.
<DownloadRule allow="false" mimeType="*/*" maxSize="10000" />
<DownloadRule allow="false" mimeType="*/*" minSize="100000" />
Note that you should not deny any <code>text/html</code> documents because usually they will contain links that will not be followed if the document will not be retrieved, therefore a good rule of thumb would be the following template:
Regular Expression URL checking
<-- ignore GIF and JPEG files -->
<RegExpRule allow="false" pattern="\.gif$" />
<RegExpRule allow="false" pattern="\.jpg$" />
<RegExpRule allow="false" pattern="\.jpeg$" />
<-- ignore CGI (this isn't a 100% solution because sometimes -->
<-- it is not possible to find out what is a CGI) -->
<RegExpRule allow="false" pattern="cgi-bin" />
<RegExpRule allow="false" pattern="\.cgi$" />
<RegExpRule allow="false" pattern="\?" />
<-- ignore ASP -->
<RegExpRule allow="false" pattern="\.asp$" />
Diese Seite wurde archiviert, d.h. sie wird nicht mehr aktiv gepflegt und die Informationen entsprechen unter Umständen nicht mehr dem aktuellen Stand.