Agrifeeds – A closer look at the Drupal modules (part 1)

Two Drupal modules have been written purposely for Agrifeeds, so to better understand what is happening behind the scenes, they will be briefly described. Both modules can work on any other Drupal website, as they have been designed for a generic use.

 

Feeds Fulltext

Feeds Fulltext can be enabled in three different ways. The first two are found in Configuration - Feeds Fulltext, and allow the user with admin permissions to enable the module either for all feeds, or by the content type attached to the feed source. The third allows to enable/disable Feeds Fulltext for individual feeds, as well as fine tune other settings, and can be found when editing a feed source node.

Once enabled, the most important and required setting is which class names the module should look for, when searching for the text in the items’ original articles. The class names must be separated by a comma, unless the element containing the text actually has more than one class name. In that case they should be inserted without comma, and in the exact order as found in the original article’s html. Four commonly found class names are given by default: “content, news, post, article”, but it is advisable to check the article’s html to find an exact match.

Another two options are given. The first one publishes the item, once the text is recovered. This is only needed if the content type attached to the item is set to not be published on save. The second option will reset all items associated to the feed, so Feeds Fulltext can process them again. This can be useful in the case the class names given were not correct. This can also be achieved for an entire content type by disabling the module for the content type, and then re-enabling it with the setting “Also past items” (default is “Only future items”).

The module works during cron. The first thing it will do is insert into the queue the newest 20 items, which have not yet been processed. Then it will take one by one a maximum of 30 items, and load the original article’s html by using the item’s link. The above mentioned numbers are at the moment not customizable, but might be in the future.

When searching for the text to replace the item’s description field, Feeds Fulltext will look for elements that contain the given class names. They don’t need to be exact matches. The class name “content” will be found both in <div class=”content”>, as well as <div class=”page-content”>. They are however case sensitive, so “Content” would not be found.

If the class name is not found, it will move to the next one in the list, and if none are found, it will simply keep the item’s original text. If instead, many elements with a particular class name are found, it will take the one with the most text in it.

All relative links in the text, which would not work from Agrifeeds or any other website using this module, are transformed into absolute links. The module also tries to avoid importing breadcrumb links from the original article, even if contained inside the element with the class name given in the settings. This might however fail at times.

The text returned will be html, although only a selection of tags will be allowed through the filter, allowing for a more pleasant visual experience. Therefore the  format of the items’ content type should be set to “Full html”. Once the text is returned, it will publish the node, if set to do so, and save all changes. Finally it will clear the cache for the item’s node.