search_attachments.module for Drupal | Mark’s web presence

search_attachments.module for Drupal

<!–

Please note…

I am in the process of moving this module to drupal.org. The project page is already available at http://drupal.org/project/search_attachments but I have not created a release package for either the 4.7.x or 5.x versions yet. I am responding to questions posted both here and on the project page until the transition is complete. –>

See the module’s project page at http://drupal.org/project/search_attachments for additional information, including archived support requests.

Purpose

search_attachments.module allows searching the text of PDF, MS Word, plain text, and other types of files attached to nodes. As of version 5.x-4, the module will also allow searching of files that are not attached to nodes but that are FTP’ed or otherwise uploaded to a Drupal site.

In order to extract the text from attached files, this module calls ‘helper apps’. The module is not limited to using specific helpers described below — Drupal administrators can configure any helpers they like.

Currently, search_attachments.module is available for Drupal 4.7.6 and 5.x. The 4.7.6 version is no longer supported, but the 5.x version will be maintained to keep pace with Drupal. All versions have been tested on Linux and Mac OS X, but users have reported no problems using the module on Windows other than with helper paths that contain spaces (see below).

If you need Drupal 6.x compatibility you may want to consider using Search Files, which offers a subset of search_attachments’ functionality. Search_attachments will be ported to Drupal 6 shortly.

Screen snapshots

Some screen snapshots are available that show both the end user and administrator views.

Helper apps

In order to use search_attachments.module, you will need the appropriate helper apps on the same computer that Drupal is running on. These apps need to print out extracted text to standard output; currently, search_attachments cannot read extracted text that is saved to a file.

Please see the Recommended Helper Apps list for links to helpers and more information.

Running search_attachments on Windows

Helpers with paths containing spaces, such as c:\Program Files\Acme Helper\ahelper.exe, will not work. You will need to install your helpers in locations that do not contain spaces. I have tried to troubleshoot this problem a number of times and have failed to come up with a reliable solution. Common sense approaches like imitating Window’s own way of quoting paths like ‘ »c:\Program Files\pfile.pl » « c:\tmp\data dir\test.txt »‘ do not work. If anyone has a solid fix for this problem, and can supply some code, I would be happy to receive it.

Module driver files

As of 2007-06-08, the 5.1-dev version of search_attachments requires the use of file manager driver files, which contain several functions specific to each module that is being used to manage attachments. Here is upload_driver.inc, the driver file for the core upload.module:

<?php
// $Id$

/**
* Each module that manages files needs to have a 'driver' like this one, containing the four functions below,
* each with the module's name (a.k.a 'the current module' or 'the file management module' below) as the
* second segment of the function name, e.g., get_upload_file_list(). More documentation is available at
* http://interoperating.info/mark/search_attachments.
*/

/**
* Returns a nested array of files which are managed by the current module, with their 'id' and
* 'path' attributes as defined by the current module, and their last modified time via stat()'s
* mtime value. We serialize $row->nid because nids are stored in {search_attachments_files} in
* serialized in order to accommodate arrays of nids for some file management modules, e.g., webfm.
* Only files that have an extension defined in the {search_attachments_helpers} table get returned
* by this function.
*/
function search_attachments_get_upload_register_files() {
 
$result = db_query("SELECT fid, filepath FROM {files}");
  while (
$row = db_fetch_object($result)) {
    if (
search_attachments_has_helper($row->filepath) && (file_exists($row->filepath))) {
     
clearstatcache();
     
$stats = stat($row->filepath);
     
$nids = search_attachments_get_upload_file_nids($row->fid);
     
$files[] = array('module_id' => $row->fid, 'nid' => $nids, 'file_path' => $row->filepath,
       
'module' => 'upload', 'changed' => $stats['mtime']);
    }
  }
  return
$files;
}

/**
* Given a file's ID in the current module's db table, returns an array of attributes ('link', 'name',
* 'size', 'nid'). We serialize $row->nid because nids are stored in {search_attachments_files} in
* serialized in order to accommodate arrays of nids for some file management modules, e.g., webfm.
*/
function search_attachments_get_upload_file($fid) {
 
// Select the parent node's ID, the attachment's file name, link and size.
 
$result = db_query("SELECT filesize AS size, filename AS name, filepath AS path FROM {files}
    WHERE fid = '%d'"
, $fid);
 
$file = db_fetch_object($result);
 
$info = array();
 
clearstatcache();
 
$stats = stat($file->path);
 
$info['mtime'] = $stats['mtime'];
  
$info['url'] = file_create_url($file->path);
  
$info['name'] = $file->name;
  
$info['size'] = $file->size;
  
$info['nid'] = search_attachments_get_upload_file_nids($fid);
   return
$info;
}

/**
* Given a file's ID in the current module's db table, returns a serialized array of all the nodes
* the file is attached to. For upload.module, there should always only be one nid in this list since
* upload.module allows a file to be attached to only one node.
*/
function search_attachments_get_upload_file_nids($fid) {
 
$nids = array();
 
$result = db_query('SELECT nid FROM {files} WHERE fid = %d', $fid);
  while (
$row = db_fetch_array($result)) {
   
$nids[] = $row['nid'];
  }
 
$serialized_nids = serialize($nids);
  return
$serialized_nids;
}

/**
* Returns a permission string that controls who can view files managed by the current module.
* This string should be identical to the one used by the file management module.
*/
function search_attachments_get_upload_view_permission() {
 
// Return permission string from module that allows users to access attachments.
 
return 'view uploaded files';
}
?>

Search_attachments comes with three driver files, upload_driver.inc, attachment_driver.inc, and webfm_driver.inc. A fourth driver, no_file_manager_driver.inc, handles files that are not managed by another module, for example if they are FTPed to your Drupal instance. If you are using any of the three file management modules or if you FTP files to your Drupal instance, you don’t have to do anything special, just follow the installation instructions below. If you are using a different file management module, you will need to create a driver for it. To do so, put a PHP file in the search_attachments directory with the same name as the module you want to get file paths from appended with ‘_driver’ and give it an ‘.inc’ extension (like ‘upload_driver.inc’). Then, write four functions in that file whose name follows the pattern ‘search_attachments_get_modulename_file_list($nid)’, ‘search_attachments_get_modulename_file($fid)’, ‘search_attachments_get_modulename_file_nids($fid), and ‘search_attachments_get_modulename_view_permission()’, as illustrated in the upload_driver.inc code above. Your code will need to return the same variables that upload_driver.inc’s functions do.

Installation and usage of search_attachments.module

  1. Install helper apps such as catdoc and pdftotext. Unix cat command is sufficient to test .txt attachements.
  2. Place search_attachments.module in your Drupal modules directory.
  3. Log into Drupal as admin and go to administer > modules and activate the module (the 5.x version installs its own database tables; the 4.7.x version uses only the Drupal variable table).
  4. Go to administer > settings > Search attachments settings and configure the helpers you want to use. If you save the settings and the module can’t find the indicated helper apps, it will tell you.
  5. Attach some files to nodes using the file management modules that you have drivers for, or upload some files and make sure you have your directory paths configured properly.
  6. Go to Administer->Site configuration->Search settings, then re-index site.
  7. Running cron.php on your site to will index any new attachments (e.g., http://yoursite.org/cron.php)
  8. Test by searching for words contained in your attachements. You should see a ‘Files’ tab (or whatever you named it in the admin settings) listing your results.

This module uses PHP’s shell_exec() function, so you should restrict ‘administer’ access to trusted users — normal end users would never need to configure site-wide search settings anyway so this should be an obvious precaution. Just thought I’d point it out.

If you are sure that the pdftotext and catdoc are installed (i.e., ‘which catdoc’ returns a valid path), and search_attachements.module still complains that it can’t find the helpers, chances are that PHP is configured using safe mode) http://php.net/features.safe-mode).

To do

  • The ability to display search results from both nodes and attachments at the same time. This will likely not happen until Drupal 7, since there is an issue open for Drupal 7’s search module that will make combined search results lists a lot more feasible.
  • Integrate functionality to allow more efficient parsing of attachments, i.e., to reduce the possibility that cron.php will time out. If this is happening to you, see http://drupal.org/node/65307 for a way to run cron from the Unix command line.
  • The ability to use helper apps that save extracted text to instead of printing it to standard output.

Thank you

-Yuri McPhedran for early testing, suggestions, and pointers to helper apps for various platforms.
-Dmitry Arkhipkin for various patches, including one that enabled indexing of attachments separate from nodes.
-Andrew Turner for drupal_get_path patch.
-Jake Ochs for feedback of various types.
-WorldFallz for the attachment.module driver.
-Everyone else who posted comments with suggestions and offers of help, or who emailed me with same.

search_attachments.module for Drupal | Mark’s web presence

Blogged with the Flock Browser

Tags: , ,

Laisser un commentaire

Entrez vos coordonnées ci-dessous ou cliquez sur une icône pour vous connecter:

Logo WordPress.com

Vous commentez à l'aide de votre compte WordPress.com. Déconnexion / Changer )

Image Twitter

Vous commentez à l'aide de votre compte Twitter. Déconnexion / Changer )

Photo Facebook

Vous commentez à l'aide de votre compte Facebook. Déconnexion / Changer )

Photo Google+

Vous commentez à l'aide de votre compte Google+. Déconnexion / Changer )

Connexion à %s

%d blogueurs aiment cette page :