Data Engineer/Big Data Developer Resume Plano, Texas - Hire IT People

PROFESSIONAL SUMMARY:

Over 7 years of IT development experience that includes Data Warehousing and Data Analysis, and more than 4 years of experience in Big Data ecosystem using Hadoop framework and related technologies such as HDFS, MapReduce, Hive, Pig, Spark, Hbase, Flume, Oozie, Sqoop, Impala, Kafka, Splunk, and Zookeeper.
Excellent knowledge of distributed storages (HDFS) and distributed processing (MapReduce, Yarn) for real - time streaming and batch processing.
Experience in developing Python and Spark Data Frames to perform data transformations and build tables in Hive.
Interpreted problems and identified areas of improvement using data analysis, data mining, statistics and machine learning techniques.
Wrote custom MapReduce programs in Java and Python to extend Hive functionality by writing custom UDFs.
Experience in extracting data from RDBMS into HDFS using Sqoop.
Worked on collecting real-time streaming data and log data from log collector into HDFS using Flume and Kafka.
Experience with NoSQL databases such as HBase for key-based low latency queries.
Experience in analyzing data in HDFS through Impala, Hive and Spark.
Experience in workflow scheduling and monitoring tools like Control-M and Oozie.
Hands-on experience in performing ETL operations using AbInitio environment and creating dashboard reports using Tableau.
Hands-on experience in Python, Java, Multi-threaded processing, SQL and PL/SQL.
Hands on experience in loading and processing unstructured data (Log files, XML data) into HDFS using Python and Flume.
Experience in performance tuning operations using Partitioning, Bucketing and Indexing in HIVE.
Hands-on experience with test frameworks for Hadoop using MRUnit framework.
Flexible with Unix/Linux Environments working with operating systems like CentOS, Redhat, Ubuntu.
In-depth knowledge of database connectivity for databases like Oracle 12c/11g/10g, MS SQL Server 2005/2008, MS Access, DB2 and Teradata.
Experience in writing ad-hoc Map Reduce programs in Pig Latin.
Experience in performing Spark RDD transformations to process data using Data Frames and Data Pipelines.
Experience in working with lambda architecture using Scala on Spark.
Worked extensively in developing PySpark scripts generate output files in formats like Avro, JSON and XML.
Experience in working with Hive tables in developing data pipelines, implementing complex business logic and optimizing Hive queries.
Worked on importing data into HBase using HBase shell and HBase client API.
Experience in automating job flows using Oozie.
Experience in working with Apache SOLR and Elasticsearch.
Worked on loading and unloading data to a set of files in Amazon S3 bucket.
Experience in developing shell scripts in UNIX and using SQL or PL/SQL to process data from the input file and load it into the database.
Involved in developing generic and custom AbInitio graphs for Unix environment
Experience in building a data lake to embrace existing historical data.
Worked on pulling relevant raw data from the data lake for analysis according to the requirements.
Developed Unix shell wrapper scripts to run graphs in development, testing, and production environments.
Experience in planning and scheduling Control-M jobs and automating workloads to reduce incident volume.
Excellent understanding of statistics, machine learning, data mining, predictive analysis, data warehousing concepts, data modeling procedures, data profiling, data structures and algorithms.

TECHNICAL SKILLS:

Big Data Ecosystem: HDFS, Map Reduce, Spark, YARN, Pig, Hive, Impala, HBase, Sqoop, Flume, Cloudera Hue, SOLR, Storm, Kafka, Elasticsearch, Oozie, Zookeeper

Languages: Java, Scala, Python, Pig Latin, HiveQL, SQL, PL/SQL, XML, JSON

Database Systems: Oracle 12c/11g/10g, MS SQL Server 2005/2008, MS Access, DB2, Teradata, Greenplum

NoSQL Databases: HBase, Cassandra, MongoDB

IDEs: Eclipse, IntelliJ, Netbeans

Scripting Tools: UNIX Shell Scripting, PERL

Operating Systems: Linux, Unix, Windows 7/Vista/XP/10

ETL Tools: Ab Initio (GDE 3.2.2, Co>Operating System 3.0.4.10), Tableau

Scheduling Tools: Control M, Tidal Enterprise Scheduler, Crontab

PROFESSIONAL EXPERIENCE:

Confidential, Plano, Texas

Data Engineer/Big Data Developer

Responsibilities:

Built a generic data ingestion framework to extract data from multiple sources like Oracle, delimited flat files, XML, Parquet, and JSON, using it to build Hive/Impala tables.
Built big data analytic solutions to provide near real-time and batch data reports to business users according to requirements.
Responsible for design, development and maintenance of workflows to integrate Shell-actions, Java-actions, Sqoop-actions, Hive-actions and Spark-actions into Oozie workflow nodes to run data pipelines.
Used Python to parse XML files and create flat files from them.
Worked with Spark Data Frames, Datasets and RDDs using Python to transform and load data into Hive tables based on the requirements.
Extensively worked with Pyspark/Spark SQL for data cleansing and generating data frames and RDDS.
Analyzed SQL scripts for the design and implementation of solutions using PySpark.
Worked on creating Spark applications in Scala using map, flatmap, collectByKey, reduceByKey, filter, groupByKey, distinct, cogroup, join, count, collect, reduce functions to process data.
Used Impala for low latency queries, visualization and faster-querying purposes.
Imported, exported and appended incremental data into HDFS using PySpark and Sqoop from Oracle database and ingested it into Hive tables.
Used PySpark to build tables that require multiple computations and non equi-joins.
Exported analyzed data to relational databases using PySpark and Sqoop for visualization and to generate reports for the BI team.
Used HBase to support front end applications that retrieve data using row keys.
Build data quality framework using Java and Impala to run data rules that can generate reports and send emails of business-critical successful and failed job notifications to business users daily.
Determine the size of data and the level of computation required to process it and leverage suitable methodologies to transform data and compute aggregations.
Handled the design and support of multi-tenancy on our data platform to allow other teams to run their applications.
Worked on configuration and automation of workflows using Control-M and led the production support teams through their operational, scheduling and monitoring activities.
Created partition tables in Hive for better performance and faster querying.
Worked extensively on creating Hive tables to store data that resulted after querying from large data sets.
Worked on debugging and performance tuning of Hive and Pig jobs.
Knowledge of SQL, Numpy, Pandas, Scikit-learn, and PySpark for data analysis and model building.
Worked on file formats and compressed files like snappy, Gzip, Bzip2, avro and text.
Processed JSON files using Pyspark and created Hive tables.
Developed Python regular expression (re module) operations in the Hadoop/Hive environment.
Developed on-the-fly decryption support for Hive, Pig and custom Map Reduce use cases using Java API.
Involved in using HCatalog to access Hive metadata from Map Reduce or Pig code.
Worked on data pre-processing and cleaning to perform feature engineering and data imputation for missing values in various datasets using Python.
Automated jobs for pulling or sending files from and to SFTP servers according to business requirements.

Confidential, Scottsdale, Arizona

Data Engineer

Responsibilities:

Developed multiple Map Reduce programs for extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
Written Map Reduce jobs to parse and then process the data as per the requirements.
Designed and implemented Map Reduce jobs to support distributed processing and provide end to end solutions using Cloudera environment.
Implemented automation of workloads using Shell Scripting.
Involved in debugging Java Map Reduce programs to process data.
Executed different logics on both ORC and text file formats.
Built the data lake environment from scratch by pooling data from different sources.
Worked extensively on importing and exporting data into HDFS and Hive/Impala tables from Relational Database Systems using Sqoop.
Used Flume to collect and store data from different sources into HDFS and later processed it using HiveQL, Pig Latin and Java.
Built scalable multi-threaded applications for large data processing using Pig.
Developed Pig Latin scripts using DDL and DML to extract data from files and load into HDFS.
Worked on HBase whenever required for low latency queries.
Used Oozie workflow engine to run multiple Hive/Impala and Pig jobs.
Assigned schemas and created Hive tables with partitioning and bucketing for faster analytics and better performance.
Implemented Spark RDD transformations and actions to migrate Map reduce algorithms.
Developed User Defined Functions (UDFs) to provide custom Hive and Pig capabilities.
Used lambda architecture to build large-scale, distributed data processing systems.
Created and compared solutions with NoSQL databases and SQL server solutions.
Involved in backup activities and preparation of transition documents and ETL (Abinitio) related design and testing.
Involved in ETL process migration from AbInitio to Hadoop environments.
Involved in data load and unload to and from HDFS using Abinitio using Hadoop read write components and direct Hive scripts.
Performed data loads and unloads from DB2 and Oracle databases to Hive using Sqoop scripts.
Guided testing teams in the test data preparation to ensure the test coverage including production data scrubbing and test data manipulation using Hive and Pig scripts.
Responsible for building and maintaining regression test tool using AbInitio, Hive, and Python script.
Worked with support teams to resolve operational and performance issues.
Created and compared solutions with NoSQL databases and SQL server solutions.
Designed distributed solutions for parallel processing of large data.
Worked with cross functional teams to get efficient data results and reports as per the business requirements.
Worked with support teams to resolve operational and performance issues.

Confidential, Columbus, Ohio

ETL Developer

Responsibilities:

Developed several AbInitio Graphs based on business requirements using various Ab Initio components like reformat, rollup, join, scan, normalize, gather etc.
Maintenance of Unix shell scripting for moving AB initio code from one Unix environment to another.
Design, Development, testing and implementation of Ab initio graphs and Korn shell scripts.
Developed Ab Initio graphs for Data validation using validate components like compare records, compute checksum etc.
Implemented Data Parallelism through graphs, by using Ab Initio partition components.
Responsible for documentation of complete Graphs and their components.
Implemented Data Parallelism through graphs, which deals with data, divided into segments and operates on each segment simultaneously through the Ab Initio partition components to segment data.
Involved in Ab Initio Design, Configuration experience in Ab Initio ETL, Data Mapping, Transformation and Loading in complex and high-volume environment and data processing at Terabytes level.
Involved in automating the ETL process through scheduling.
Responsible for cleansing the data from source systems using Ab Initio components such as reformat and filter by expression.
Developed Psets to impose reusable business restrictions and to improve the performance of the graph.
Developed several partition-based Ab Initio Graphs for high volume data warehouse.
Experience in batch scheduling using Control-M Workload Automation.
Experience in Control-M version migration and maintaining server integrity to reduce downtime in the production environment.
Experience in Data Analysis to mitigate the level of incident frequency by using various incident and change management tools such as Hive, Automic Workload Automation, Control-M, Resolve and HP Service Manager.
Co-ordination of various L3 support teams and SME teams to track and document incident data.

Confidential

Business Analyst

Responsibilities:

Performed SWOT Analysis of transaction processing in current state system and workflows, and estimated savings for proposed workflow changes and system enhancements.
Identified Return on Investment (RoI) opportunities that will reduce costs and increase transaction processing efficiency achieved through automation.
Modeled the system using Use Cases and State diagrams, and performed gap, risk, and cost-benefit analysis.
Delivered Business Requirement Documentation (BRD) from Use Case specifications, identified goals and objectives, and supported Functional Specification Documentation (FSD) and Data Mapping Documentation.
Researched and modeled activity workflows to report business trends for business unit development.
Wrote SQL statements to extract data from application back-end and verify it with front-end results.
Involved in developing and handling PL/SQL packages, procedures and database triggers.
Identified data sources, coordinated data feeds from external sources and the development of reports, queries, and forms to meet the information needs of business partners.
Involved in writing business logic and development and modification of a business portal which facilitates end-users to apply for financial licenses online.
Generated customer data reports using PL/SQL procedures and handled generation of user profiles.
Involved in the development of a Tallyman collection product that allows businesses to manage the recovery of debt from customers.
Provided technical support for clients, business partners and end-users.
Involved in redefining subscriber profiles using PL/SQL procedures, triggers based on the number of cheque bounces, number of barring activities, etc.
Prepared technical specification, interface, and detailed design documents to generate customer status reports using PL/SQL stored procedures in Oracle 11g.

We provide IT Staff Augmentation Services!

Data Engineer/big Data Developer Resume

Plano, TexaS

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship