pdf generation - PDF Parsing -- Extract single page -

i wrote program in python allowed me read in pdf, take commands user, , output part or of original pdf pages in different orders. select pages interested in. @ time, there great library it, pypdf2. did of heavy lifting.

now, working in language (haskell) has no pdf support can find. i'm considering making own personal library. however, when looking @ contents of pdf file, i'm finding hard determine specific pages are. can tell how many pages total there in file, can't @ specific part of file , say, "this page x of y." so, how separate out content based on pages? how split file based on pages, if don't know page content on?

the first thing need copy of pdf specification. can download free adobe web site here: http://wwwimages.adobe.com/content/dam/adobe/en/devnet/pdf/pdfs/pdf32000_2008.pdf

in document, @ section 7.7.3 explains how "page tree" works.

basically, pdf file contains tree (adobe suggests should balanced tree you're under no obligation keep that) starting "pages" object, optionally containing number of intermediate level objects , ending in "page" objects. example:

pages . pages   . page (1)   . page (2)   . page (3) . pages   . pages     . page (4)     . page (5)   . pages     . page (6)     . page (7)

the number of levels in tree not limited. find given page, have walk tree start finish, assigning page numbers find leaf "page" objects. have indicated in above example page numbers these objects represent (starting page index 1).

once have page object, can use (and potentially parents) find resources need page. again in pdf specification "resources" dictionary , mind discussion inheritance.

Search This Blog

Szoka

pdf generation - PDF Parsing -- Extract single page -

Comments

Post a Comment

Popular posts from this blog

facebook - android ACTION_SEND to share with specific application only -

python - Creating a new virtualenv gives a permissions error -

go - Idiomatic way to handle template errors in golang -