Skip to content

Commit

Permalink
PDFParser final version
Browse files Browse the repository at this point in the history
  • Loading branch information
oscii committed May 17, 2015
1 parent d27e703 commit 1ec830f
Show file tree
Hide file tree
Showing 8 changed files with 1,036 additions and 42 deletions.
12 changes: 7 additions & 5 deletions ceur-ws-pdfs/CeurWsPDFParser/parsers/PdfExtractionLib.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def main():
else:
a = 1

def get_html_and_txt(input_filename, add_files = False, update_files = True):
def get_html_and_txt(input_filename, add_files = True, update_files = True):
try:
out_inf = {
"html": u"",
Expand Down Expand Up @@ -56,18 +56,20 @@ def get_html_and_txt(input_filename, add_files = False, update_files = True):
html_command =u"{0} -o \"{1}\" \"{2}\"".format(path_to_pdf2txt, temp_html_file, input_filename)

if not os.path.exists(temp_txt_file):
print(txt_command)
#print(txt_command)
#os.system(txt_command)
a = 1
else:
if update_files:
print(txt_command)
#print(txt_command)
#os.system(txt_command)
a = 1
if not os.path.exists(temp_html_file):
print(html_command)
#print(html_command)
os.system(html_command)
else:
if update_files:
print(html_command)
#print(html_command)
os.system(html_command)

# fh = codecs.open(temp_txt_file, 'rb')
Expand Down
257 changes: 257 additions & 0 deletions ceur-ws-pdfs/CeurWsPDFParser/parsers/dictionaries/countries.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
Afghanistan
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antarctica
Antigua And Barbuda
Argentina
Armenia
Aruba
Australia
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bermuda
Bhutan
Bolivia
Bosnia And Herzegovina
Botswana
Bouvet Island
Brazil
British Indian Ocean Territory
Brunei
Brunei Darussalam
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Canada
Cape Verde
Cayman Islands
Central African Republic
Chad
Chile
China
Christmas Island
Cocos (keeling) Islands
Colombia
Comoros
Congo
Congo
The Democratic Republic Of The Cook Islands
Costa Rica
Cote D'Ivoire
Croatia
Cuba
Cyprus
Czech Republic
Denmark
Djibouti
Dominica
Dominican Republic
East Timor
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Ethiopia
Falkland Islands
Malvinas
Faroe Islands
Fiji
Finland
France
French Guiana
French Polynesia
French Southern Territories
Gabon
Gambia
Georgia
Germany
Ghana
Gibraltar
Greece
Greenland
Grenada
Guadeloupe
Guam
Guatemala
Guinea
Guinea-Bissau
Guyana
Haiti
Heard Island And McD-onald Islands
Holy See
Honduras
Hong Kong
Hungary
Iceland
India
Indonesia
Iran
Islamic Republic Of Iran
Iraq
Ireland
Israel
Italy
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kiribati
Korea
Democratic People's Republic Of Korea
Korea
Republic Of Korea
Kosovo
Kuwait
Kyrgyzstan
Lao People's Democratic Republic
Latvia
Lebanon
Lesotho
Liberia
Libyan Arab Jamahiriya
Libya
Liechtenstein
Lithuania
Luxembourg
Macau
Macedonia
The Former Yugoslav Republic Of Macedonia
Madagascar
Malawi
Malaysia
Maldives
Mali
Malta
Marshall Islands
Martinique
Mauritania
Mauritius
Mayotte
Mexico
Micronesia
Federated States Of Micronesia
Moldova
Republic Of Moldova
Monaco
Mongolia
Montserrat
Montenegro
Morocco
Mozambique
Myanmar
Namibia
Nauru
Nepal
Netherlands
Netherlands Antilles
New Caledonia
New Zealand
Nicaragua
Niger
Nigeria
Niue
Norfolk Island
Northern Mariana Islands
Norway
Oman
Pakistan
Palau
Palestinian Territory, Occupied
Panama
Papua New Guinea
Paraguay
Peru
Philippines
Pitcairn
Poland
Portugal
Puerto Rico
Qatar
Reunion
Romania
Russian Federation
Russia
Rwanda
Saint Helena
Saint Kitts And Nevis
Saint Lucia
Saint Pierre And Miquelon
Saint Vincent And The Grenadines
Samoa
San Marino
Sao Tome And Principe
Saudi Arabia
Senegal
Serbia
Seychelles
Sierra Leone
Singapore
Slovakia
Slovenia
Solomon Islands
Somalia
South Africa
The Republic of South Africa
South Georgia And The South Sandwich Islands
Spain
Sri Lanka
Sudan
Suriname
Svalbard And Jan Mayen
Swaziland
Sweden
Switzerland
Syrian Arab Republic
Taiwan
Province Of China
Tajikistan
Tanzania
United Republic Of Tanzania
Thailand
Togo
Tokelau
Tonga
Trinidad And Tobago
Tunisia
Turkey
Turkmenistan
Turks And Caicos Islands
Tuvalu
Uganda
Ukraine
United Arab Emirates
United Kingdom
UK
United States
United States of America
US
United States Minor Outlying Islands
Uruguay
Uzbekistan
Vanuatu
Venezuela
Vietnam
Virgin Islands, British
Virgin Islands, U.S.
Wallis And Futuna
Western Sahara
Yemen
Zambia
Zimbabwe
44 changes: 44 additions & 0 deletions ceur-ws-pdfs/CeurWsPDFParser/parsers/dictionaries/existonto.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
FOAF
TrackBack
MetaVocab
Basic Geo Vocabulary
BIO
RSS 1.0
VCard RDF
Creative Commons metadata
WOT
SIOC
Semantically-Interlinked Online Communities
DwC
Darwin Core
SSN
Semantic Sensor Network
GoodRelations
DOAP
Programmes Ontology
Music Ontology
Provenance Vocabulary
Pedagogical diagnosis
DILIGENT Argumentation Ontology
OpenGUID
The Multi-Source Ontology
MSO
Open Biomedical Ontologies
Ontology Design Patterns
Semantic Web for Earth and Environmental Terminology
LKIF
Core Ontology of Basic Legal Concepts
OpenGALEN
OpenGALEN Medical Ontology
Stanford Library of Ontologies
SUMO
WordNet
DOLCE
Quantum Mechanics Ontology
Engineering Mathematics Ontologies
SchemaWeb
Ontology for Long Term Ecological and Socioecological research
Finance ontology Ontology on securities handling
OntoCAPE
SNOMED
SNOMED CT5
16 changes: 16 additions & 0 deletions ceur-ws-pdfs/CeurWsPDFParser/parsers/dictionaries/stopwords.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
OWL
OWL DL
RDF
RDF Schema
RDF
OWL 2
OWL 2
OWL DL
OWL 2
CycL
JPL
NCI
SPARQL
PL KB
PL
KB
Loading

0 comments on commit 1ec830f

Please sign in to comment.