Results 1 to 4 of 4
  1. #1
    Join Date
    Mar 2007
    Posts
    6
    Plugin Contributions
    0

    Default Scraping product info from a PDF catalog for mass upload

    Hi Zenners,

    This is driving me slightly nuts: I am preparing a spreadsheet for a mass upload of products. The information sources are a pricelist based on an excel spreadsheet (super easy to modify for a mass upload) and a product catalog based on a PDF.

    The only iniformation I need to extract from the PDF is the product description. The descriptions are generally a paragraph of text and a feature list.

    I am having serious difficulty scraping the product info in to the spreadsheet in a way that doesn't lose it's formating and look terrible once it is in. I'm not seeking to preserve font styling or colors, just the spacing and bullets.

    I'm finding that, for the projects I have completed so far, most suppliers seem to use PDF as a means of digitally distributing their product catalogs. This is a terrible digital format for interchange of information. I'm hoping there may be other Zenners who have had a similar exerience who may be able to shed some light on how to manage this?

    If I have to go in to every product in Zen Cart to correct the formating of the description text, it kinda defeats the object of having a mass upload in the first place.

    Any ideas welcomed!

    Regards,

    Ked

  2. #2
    Join Date
    Jul 2006
    Posts
    213
    Plugin Contributions
    0

    Default Re: Scraping product info from a PDF catalog for mass upload

    Ked,
    Several thoughts come to mind:
    1. There are programs, some free, which will convert PDF files to editable text. I don't know how well they preserve the original formatting such as bulleted lists.
    2. You can try to do a copy and paste into an HTML editor such as NVu or Dreamweaver, clean up the formatting and paste it to the spreadsheet. It will be faster than trying to edit the HTML in the product listing. I find it is far better than the HTMLarea editor supplied in ZenCart.
    3. The things that drive me crazy are product specs like "ingredients" which use dotted lines for spacing like "Sugar . . . . 5 gms" I can never decide if it is better to use a list or a table format.
    4. Same problems if you're scraping info from the manufacturer's website, although at least you have HTML, not unformatted text.
    5. Be sure you don't run into an copyright problems. I figure the manufacturer supplied the information, he shouldn't object to my using it to describe his product that I'm selling, but you never know. Some are very anal about it.

    HTH


  3. #3
    Join Date
    Mar 2007
    Posts
    6
    Plugin Contributions
    0

    Default Re: Scraping product info from a PDF catalog for mass upload

    Hi Maury,

    Thanks for your reply, I really appreciate the feedback.

    I've found that Dreamweaver and Word especially overcomplicate the HTML they generate; I end up with yards of spans, divs and CSS where a simple <p> or <ul> would do the job.

    I found a really handy HTML converter today, just before I read your post in fact. It's called "Easy Text to HTML Converter":

    http://www.easyhtools.com/ethdescription.html

    This has reduced populating the spreadsheet down to a 3 click process which is great. I can select the data in the PDF, past it in to the app, convert it then drop the tagged text / lists straight in to the spreadsheet cell. The great thing is, it doesn't add any headers, spans, divs, CSS or anything beyond very basic <p>, <ul> and <li> tags. It seems to detect lists and apply <ul>s very cleanly. It looks like the lifesaver I was searching for :)

    I agree with the dilema between list and table format too; I've experimented with both over the past week and in this case I'm finding the lists look a lot clearer.

    I really am amazed that a lot of suppliers / manufacturers make it so difficult to catalog their products by using inappropriate formats for their information.

    Thanks again for the response.

    regards,

    Ked

  4. #4
    Join Date
    Jul 2006
    Posts
    213
    Plugin Contributions
    0

    Default Re: Scraping product info from a PDF catalog for mass upload

    Ked,
    Thanks for the lead to Easy Text to HTML converter. Looks like a very useful tool.

    Maury

 

 

Similar Threads

  1. Replies: 3
    Last Post: 4 Dec 2010, 08:12 PM
  2. Module for mass product upload/adding
    By mancer in forum Setting Up Categories, Products, Attributes
    Replies: 3
    Last Post: 10 Jul 2008, 06:08 AM
  3. Looking for info on an Automated PDF Catalog
    By wizardsandwars in forum General Questions
    Replies: 6
    Last Post: 21 May 2007, 01:49 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
disjunctive-egg
Zen-Cart, Internet Selling Services, Klamath Falls, OR