Installation
There is no OS specific download anymore. Focus of JoBo are Java developers. Therefore I will not provide support for setting up a Java runtime environment for your workstation.
New to Java? Have a look at Suns Java Homepage!
XML configuration
There is a file <code class="filename">jobo.xml</code> in your JoBo directory. Editing this file will allow you to modify the default behaviour of JoBo. You should not use an text editor for this if you are not sure what you are doing. The best tool would be an XML editor that allows to use a DTD (e.g.Merlot, or look at XMLSoftware.com for other editors). Then you can be sure that JoBo will accept your XML document.
<JoBo>
<Robot>
<AgentName>JoBo</AgentName>
<StartReferer>http://www.matuschek.net/jobo.html</StartReferer>
<IgnoreRobotsTxt>false</IgnoreRobotsTxt>
<SleepTime>60000</SleepTime>
<MaxDepth>2</MaxDepth>
<WalkToOtherHosts>false</WalkToOtherHosts>
<Proxy>webproxy.my-provider.com:8080</Proxy>
<Bandwidth>1024</Bandwith>
<MaxDocumentAge>86400</MaxDocumentAge>
<AllowWholeHosy>true</AllowWholeHost>
<AllowWholeDomain>false</AllowWholeDomain>
<FlexibleHostCheck>true</FlexibleHostCheck>
<AllowedUrl>http://www.matuschek.net</AllowedUrl>
<VisitMany>http://www.matuschek.net</VisitMany>
</Robot>
<DownloadRuleSet>
<DownloadRule allow="false" minsize="0" maxsize="1024" mimeType="image/*"/>
<DownloadRule allow="true" mimeType="*/*"/>
</DownloadRuleSet>
<URLCheck>
<RegExpRule allow="true" pattern="." />
</URLCheck>
<LocalizeLinks>true</LocalizeLinks>
<JoBo>
<AgentName>
Defines the AgentName: header in the HTTP request. Usually there is no need to change this setting
<StartReferer>
JoBo uses HTTP Referer headers while walking through web sites. Using this setting you can define the Referer: header for the first document that will be retrieved
<IgnoreRobotsTxt>
If you set this to <code>true</code> JoBo will ignore the <code class="filename">robots.txt</code> file on the web site you want to download. You should not do this ! Why did I include this feature ? Because JoBo is open source, anybody that want to steal web sites could easily remove the robot.txt check in the source. It is also useful if you use it to mirror your own sites.
<SleepTime>
After every retrieved document JoBo delays before retrieving the next, because we don't want to produce much load on the destination web site. If you have a very low bandwith connection (e.g. dialup 33k), you can descrease this value (even to 0).
<MaxDepth>
Defines the maximal search depth of JoBo. JoBo will only download files that are that many clicks away from the starting page. Don't set this to very big values, because it is also used to detect loops. Usually JoBo will detect loops (it will not retrieve the same URL twice), but some web servers use special rewrite rules that will allow endless URLs like /a/a/a/a/a/a/a/a/a/a/a/a ....
<WalkToOtherHosts>
Setting this to <code>true</code>will allow JoBo to travel to any other website that is linked from the current site. Only change this setting if you know what you are doing !
<AllowWholeHost>
By default, JoBo will download all URLs from the start host. That means, if your start URL is www.matuschek.net/photo it will also download www.matuschek.net and www.matuschek.net/papers if it finds a link to this URLs somewhere.
This can cause problems on big web servers, therefore you can turn off this behavior with <code><AllowWholeHost>false</AllowWholeHost></code>. In this case JoBo will only collect documents below the start URL, e.g. if your start URL is www.matuschek.net/photo, it is allowed to collect www.matuschek.net/photo/index.html and www.matuschek.net/photo/pics.html but not www.matuschek.ent
Note that if you start with /something JoBo will also download /other, because it can't figure out if /something is a file or a directory. Therefore it assumes that everything without a trailing slash is a file and everything with a slash a the end ist a directory.
<AllowWholeDomain>
By default, JoBo does not travel to other hosts then the start host. But some companies use different web server names. If you want to download all stuff form the Apache group, you will need to grab www.apache.org, java.apache.org, jakarta.apache.org, xml.apache.org and others. For this purpose you can set <code>AllowWholeDomain</code> to true. If you start at www.apache.org, JoBo is allowed to walk to all web servers in the same domain (apache.org in this example).
<FlexibleHostCheck>
Some web servers use an inconsistent addressing scheme and use the host name for the web server with and without the prefix "www." (e.g. www.sourceforge.net and sourceforge.net).
Setting <code>FlexibleHostCheck</code> to true will not make a difference between the hostname with and without the leading "www." and allow JoBo to download both.
<Proxy>
Set this if you need to use a HTTP proxy. The syntax of the proxy setting is proxyname:port.
<Bandwidth>
Using this tag you can limit the bandwidth used by JoBo. This could be interesting if you don't want JoBo to use all of your available internet bandwidth.
The value is defined in Bytes per seconds. Setting this to 0 will disable bandwidth limitations.
<MaxDocumentAge>
<EnableCookies>
If true, JoBo will accept cookies from web servers, otherwise it will not use cookies.
<AllowedUrl>
You can have many <AllowedUrl> statements in your jobo.xml
<VisitMany>
<LocalizeLinks>
Why this XML configuration ?
Form handlers
Lets have a simple example for a FormHandler:
<FormHandler url="http://www.matuschek.net/test.cgi">
<FormField name="user" value="daniel"/>
<FormField name="pass" value="test"/>
</FormHandler>
Or another form handle example that starts a search on Google for the term "jobo":
<FormHandler url="http://www.google.com/search">
<FormField name="q" value="jobo"/>
</FormHandler>
For a form handler you have to define the form URL and default values for form fields.
There can be many form handlers be defined in the <code class="filename">jobo.xml</code> file
Download rules
<DownloadRule allow="false" mimeType="application/zip" />
<DownloadRule allow="false" mimeType="*/*" minSize="100000" />
A DownloadRule has the attributes allow, mimeType, minSize, maxSize. minSize and maxSize are optional arguments.
<DownloadRule allow="false" mimeType="*/*" maxSize="10000" />
<DownloadRule allow="false" mimeType="*/*" minSize="100000" />
You have to use the following rule:
<DownloadRule allow="false" mimeType="*/*" minSize="10000" maxSize="100000" />
<-- ignore thumbnails -->
<DownloadRule allow="false" mimeType="image/*" maxSize="10000" />
<DownloadRule allow="true" mimeType="text/html" />
Regular Expression URL checking
Regular Expression URL checking
Note that wildchars differ from "usual" wildchar. Therefore you use "\.gif$" instead of "*.gif". For more information about regular expressions have a look at the web (e.g. AskJeeves).
Examples:
<-- ignore GIF and JPEG files -->
<RegExpRule allow="false" pattern="\.gif$" />
<RegExpRule allow="false" pattern="\.jpg$" />
<RegExpRule allow="false" pattern="\.jpeg$" />
<-- ignore CGI (this isn't a 100% solution because sometimes -->
<-- it is not possible to find out what is a CGI) -->
<RegExpRule allow="false" pattern="cgi-bin" />
<RegExpRule allow="false" pattern="\.cgi$" />
<RegExpRule allow="false" pattern="\?" />
<-- ignore ASP -->
<RegExpRule allow="false" pattern="\.asp$" />

