We provide IT Staff Augmentation Services!

Big Data Engineer Resume

0/5 (Submit Your Rating)

Philadelphia, PA

SUMMARY

  • Having 8 years of experience in Analysis, Design, Development, Integration, Testing and maintenance of various applications using python along with 5 years of Big Data /Hadoop/Pyspark experience.
  • Experienced in building highly scalable Big - data solutions using Hadoop and multiple distributions i.e., Cloudera, Horton works and NoSQL platforms.
  • Expertise in big data architecture with Hadoop File system and its eco system tools MapReduce, HBase, Hive, Pig, Zookeeper, Oozie, Kafka, Flume, Avro, Impala and Apache Spark.
  • Hands on experience on performing Data Quality checks on petabytes of data
  • Good knowledge on Amazon AWS concepts like EMR & EC2 web services which provides fast and efficient processing of Big Data.
  • Developed, deployed and supported several Map Reduce applications in python to handle semi and unstructured data.
  • Experience in writing Map Reduce programs and using Apache Hadoop API for analyzing the data.
  • Expertise in developing HIVE scripts for data analysis
  • Hands on experience in data mining process, implementing complex business logic and optimizing the query using Hive QL and controlling the data distribution by partitioning and bucketing techniques to enhance performance.
  • Experience working with Hive data, extending the Hive library using custom UDF's to query data in non- standard formats.
  • Involved in the Ingestion of data from various Databases like DB2, SQL-SERVER using Sqoop.
  • Experience working with Kafka to handle large volume of streaming data.
  • Extensive experience in migrating ETL operations into HDFS systems using Pig Scripts.
  • Good knowledge in evaluating big data analytics libraries (ML lib) and use of Spark-SQL for data exploratory.
  • Experience in implementing a distributed messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.
  • Expert in creating and designing data ingest pipelines using technologies such as spring Integration, Apache Storm- Kafka
  • Experience with Oozie Workflow Engine in running work flow jobs with actions that run Hadoop MapReduce and Pig jobs.
  • Worked with different File Formats like TEXTFILE, AVROFILE, ORC for HIVE Querying and Processing.
  • Used Compression Techniques (snappy) with file formats to leverage the storage in HDFS.
  • Experienced with build tools Maven, ANT and continuous integrations like Jenkins.
  • Hands-on experience in using relational databases like Oracle, MySQL, PostgreSQL and MS-SQL Server.
  • Experience using IDEs tools Eclipse 3.0, My Eclipse, RAD and Net Beans.
  • Hands on development experience with RDBMS, including writing SQL queries, PLSQL, views, stored procedure, triggers, etc.
  • Highly motivated, dynamic, self-starter with keen interest in emerging technologies.

TECHNICAL SKILLS

BigData Technologies: HDFS, Map Reduce, Hive, Pig, Sqoop, Flume, Oozie, Avro, Hadoop Streaming, Zookeeper, Kafka, Impala, Apache Spark, hue, Ambari. Apache ignite.

Hadoop Distributions: Cloudera (CDH4/CDH5), Horton Works

Languages: PYTHON, PL/SQL,, HQL

IDE Tools: Pycharm, Vscode, Databricks notebooks

Framework: Flask, pyspark, pyhive, web2py

Operating Systems: Windows (XP,7,8), UNIX, LINUX, Ubuntu, CentOS

Application Servers: J Boss, Tomcat, Web Logic, Web Sphere, Servlets

Reporting Tools/ETL Tools: Tableau, Power view for Microsoft Excel, Informatica

Databases: Oracle, MySQL, DB2, Derby, PostgreSQL, No-SQL Database (HBase, Cassandra)

PROFESSIONAL EXPERIENCE

Confidential, Philadelphia, PA

Big Data Engineer

Responsibilities:

  • Being a team member of RECON, handled reconciliation of different data.
  • Also created dashboards on graphana for monitoring different metrics for different jobs
  • Support hands-on architecture of ETL pipelines using our internal framework written in Apache Spark and Python
  • Implement DQ metrics and controls for data in a big data environment
  • Interpret data and analyze results using statistical techniques and provide ongoing reports
  • Develop and implement databases, data collection systems, data analytics and other strategies that optimize statistical efficiency and quality.
  • Develop and Implement ETL pipelines to pick data from s3 through intermediate servers and push ThoughtSpot servers for generating visualization reports.
  • Implement ETL to load data from Oracle to s3.
  • Develop and Implement ETL load from HDFS to S3 using databricks and spark with python.
  • Create, run and schedule ETL pipelines using RUNDECK tool with a process control mechanism.
  • Hands-on architecture/development of ETL pipelines using our internal framework written in Apache Spark & Java & python for the process control Hands-on development in consuming Kafka/REST APIs or other streaming sources using Spark and persisting data in Graph or any NoSQL databases.
  • Implement DQ metrics and controls for data in a big data environment Interpret data, analyze results using statistical techniques and provide ongoing reports
  • Develop and implement databases, data collection systems, data analytics and other strategies that optimize statistical efficiency and quality Acquire data from primary or secondary data sources and maintain databases/data systems Identify, analyse, and interpret trends or patterns in complex data sets Filter and clean data by reviewing reports and performance indicators to locate and correct problems Work with management to prioritize business and information needs Locate and define new process improvement opportunities.
  • Provide architectural, best practice ideas and suggestions to better current setup.

Environment: Oozie, Hive, HDFS, Spark 2.3.0, Scala, IntelliJ, AWS, S3 Bucket, MySQL, Sqoop, Bit Bucket, Maven, Ambari Snowflake, SnowSQL, Snowpipe

Confidential, NC

Big Data Engineer - python

Responsibilities:

  • Worked on implementation of pipelines using python and SnowSQL and Snowpipe and created Snowpipe for continuous data load.
  • Well versed with python features like using packages like numpy, pandas building apis and using flask.
  • As part of this experience worked on data Cleaning and data engineering like feature selection, feature extraction and building data flow pipelines with Airflow.
  • Also creating lambda functions with python and triggering the lambda functions with cloud watch events on AWS.
  • Also used Databricks with python to build notebooks and integrate them with the existing pipelines.
  • Well versed with Snowflake features like clustering, time travel, cloning, logical data warehouse, caching etc.
  • Involved in data migration to snowflake using AWS S3 buckets.
  • Also converted SQL Server mapping logic to Snow SQL queries.
  • Wrote different Snow SQL queries to interact with compute layer and retrieve data from storage layer.
  • Worked on data analysis in Snowflake and data migration from various sources like flat file, Hana, Teradata, Oracle, MongoDB, Hive and Kafka.
  • Working on ETL development in Snowflake.
  • Experience in building CI/CD pipeline & Dev Ops integration using Oozie, Git and Screwdriver.
  • Knowledge on AWS solutions using EC2, EMR, S3, RDS, EBS, Elastic Load Balancer, Auto scaling groups, VPC, Cloud Formation.
  • Secure EMR launcher with custom spark submit steps using S3 Event, SNS, KMS and Lambda function.
  • Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
  • Used AWS data pipeline for Data Extraction, Transformation and Loading from homogeneous or heterogeneous data sources and built various graphs for business decision-making using Python matplot library.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Implemented Spring boot microservices to process the messages into the Kafka cluster setup.
  • Used Spring Kafka API calls to process the messages smoothly on Kafka Cluster setup.
  • Wrote Kafka producers to stream the data from external rest APIs to Kafka topics
  • Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
  • Extracted files from Cassandra and MongoDB through Sqoop and placed in HDFS and processed.
  • Good understanding of Zookeeper for monitoring and managing Hadoop jobs.
  • Experience with RDBMS and writing SQL and PL/SQL scripts used in stored procedures.
  • Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations.
  • Developed Spark code using Scala and Spark-SQL for faster processing of data.

Environment: Oozie, Hive, HDFS, Spark 2.3.0, Scala, IntelliJ, AWS, S3 Bucket, MySQL, Sqoop, Bit Bucket, Maven, Ambari Snowflake, SnowSQL, Snowpipe.

Confidential, Mooresville, NC

Big Data Engineer

Responsibilities:

  • Processing data with python, pyspark, spark SQL and load in hive partition tables in parquet file format.
  • Created multipleHivetables, implemented partitioning, dynamic partitioning and buckets inHivefor efficient data access.
  • Implementing SCD’s with python and hive.
  • Working with kafka to load data from Oracle to hive to s3 using python.
  • Imported and exported data into HDFS and Hive using Sqoop.
  • Build and maintainSQL scripts, Indexes,and complex queries for data analysis and extraction.
  • Experience in using Scala to convert Hive/SQL queries into RDD transformations in Spark.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frames and Pair RDD's.
  • Worked in querying data using Spark SQL on top of Spark engine.
  • Joined various tables in Cassandra usingspark and python and ran analytics on top of them.
  • Uses SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
  • Develop, Deploy and Troubleshoot the ETL Workflows using Hive and Sqoop.
  • Using g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket.
  • Involved in schedulingOozieworkflow engine to run multiple Hive, Pig and Spark jobs.
  • Created data pipeline of gathering, cleaning and optimizing data using Hive, Spark.
  • Worked with different actions in Oozie to design workflow like Sqoop action, pig action, hive action, shell action.
  • Worked withNoSQL databaseslikeHBasein creating tables to load large sets ofsemi structured data comingfrom source systems.
  • Handled importing of data from various data sources, performed transformations using Hive and MapReduce for loading data into HDFS and extracted the data from MySQL into HDFS using Sqoop.
  • Involved in scheduling Oozie workflow engine to run multiple Hive jobs
  • Used Spark SQL to process the huge amount of structured data available in Hive Tables.
  • Extracted data from RDMS through Sqoop and placed in HDFS and processed using Hive.
  • Loaded data into Hive Tables from Hadoop Distributed File System (HDFS) to provide SQL-like access on Hadoop data.
  • Utilized Oozie workflow to run Pig and Hive Jobs Extracted files from Mongo DB through Sqoop and placed in HDFS and processed.
  • Migrated HiveQL queries into SparkSQL to improve performance

Environment: Oozie, Hive, HDFS, Spark 2.3.0, Scala, IntelliJ, BigQuery, Gcs Bucket, MySQL, Sqoop,Bit Bucket, Maven, Ambari.

Confidential, Phoenix, AZ

Hadoop Developer

Responsibilities:

  • To enable Spark Streaming with python coding, which collects data from Kafka in near-real-time and performs necessary transformations and aggregation to build the common learner data model and stores the data in NoSQL store (HBase).
  • Installing and configuring of various components of Hadoop ecosystem such as Flume, Hive, Pig, Sqoop, Oozie, Zookeeper, Kafka, and Storm and maintained their integrity.
  • Have a good knowledge on Confidential internal data sources such as Cornerstone, WSDW, IDN, and SQL.
  • We use Apache Kafka Connect for streaming data between Apache Kafka and other systems.
  • Experience in performing advanced procedures like text analytics using in-memory computing capabilities of Spark using python.
  • Partitioning data streams using Kafka. Designed and configured Kafka cluster to accommodate heavy throughput.
  • Responsible for fetching real time data using Kafka and processing using Spark streaming with python.
  • Implemented Hive Partitioning and Bucketing on the collected data in HDFS.
  • Experienced in writing real-time processing and core jobs usingSparkStreaming with Kafka as a data pipeline system.
  • Implemented Sqoop jobs to import/export large data exchanges between RDBMS and Hive platforms.
  • Using Kafka Connect is a utility for streaming data betweenMapR Event Store for Apache Kafkaand other storage systems.
  • Worked on visualization tool Tableau for visually analyzing the data.
  • Developed python scripts using both Data frames/SQL/Datasets and RDD/MapReduce in Spark for Data aggregation, queries and writing data back into OLTP system through Sqoop.
  • Migrated Map Reduce programs into Spark transformations using python.
  • Application deployment and scheduling on cloud Spark/Ambari.

Environment: MapReduce, python, Springframeworks2.1.3, Oracle 11.2.0.3, Kafka connectors, Spark, Hive Sql, Node Js V8.11.1, no Sql, Java Version 1.8, Tableau,Ambari user views, sparkreal time data source, cloud platform, consumers.

Confidential, MA

Hadoop Developer

Responsibilities:

  • Designed schema and modeling of data and written algorithm to store all validated data in Cassandra using Spring Data Cassandra Rest.
  • To standardize the Input Merchants Data, uploading images, index the given Data sets into Search and persist the data on HBase tables.
  • Setting up the Spark streaming and Kafka Cluster and developed a Spark Streaming Kafka App.
  • Developed prototype Spark applications using Spark-Core, Spark SQL, Data Frame API
  • Involved in data analysis using python and handling the ad-hoc requests as per requirement.
  • Developing python scripts for automating tasks.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Importing and exporting data into HDFS and Hive using Sqoop.
  • Supported in setting up QA environment and updating configurations for implementing scripts with Pig and Sqoop.
  • Used AWS services like EC2 and S3 for small data sets.
  • Hands on experience in AWS Cloud in various AWS services such as Redshift cluster, Route 53 domain configuration.
  • Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
  • Generate Stock Alerts, Price Alerts, Popular Product Alerts, New Arrivals for each user based on given likes, favorite, shares count information.

Environment: Cassandra, Hive, Spark (Core, SQL, ML Lib, Streaming), Hadoop, MapReduce, python, AWS, Zookeeper, Shell Scripting

Confidential, ST Cloud, MN

Hadoop Developer

Responsibilities:

  • Developed a simulator to send / emit events based on NYC DOT data file.
  • Built Kafka Producer to accept / send events to Kafka Producer which is on Storm Spout.
  • Created, altered and deleted topics using Kafka Queues when required with varying
  • Setting up and managing Kafka for stream processing and Broker and topic configuration and creation
  • Load and transform large sets of unstructured data from UNIX system to HDFS.
  • Developed Use cases and technical prototyping for implementing PIG, HDP, HIVE and HBASE.
  • Experienced in running Hadoop streaming jobs to process terabytes of CSV format.
  • Supported Map Reduce Programs those are running on the cluster.
  • Developing data pipeline using Flume, Sqoop, Pig and Java map reduce to ingest customer behavioral data and financial histories into HDFS for analysis.
  • Written Storm topology to accept events from Kafka Producer and Process Events.
  • Developed Storm Bolts to Emit data into HBase, HDFS, Rabbit-MQ Web Stomp.
  • Hive Queries to Map Truck Events Data, Weather Data, Traffic Data

Environment: Hadoop, HDFS, Hive, HBase, Kafka, Storm, Rabbit-MQ Web Storm, Google Maps, New York City Truck Routes from NYC DOT.

Confidential

Java/ Hadoop Developer

Responsibilities:

  • Involving in sprint planning as part of monthly deliveries.
  • Involving in daily scrum calls and standup meetings as part of agile methodology.
  • Good hands on experience on Version One tool to update the work details and working hours for a task.
  • Involving in the designing part of views.
  • Involving in Writing Spring Configuration Files and Business Logic based on Requirement.
  • Involved in code-review sessions.
  • Implementing Junit tests based on the business logic w.r.t to assigned backlog in sprint plan.
  • Implementing the Fixtures to execute the Fitness test tables.
  • Good experience on creating the Jenkins CI jobs and Sonar jobs.

Environment: Core Java, spring, Maven, XMF Services, JMS, Oracle10g, PostgreSQL, 9.2, Fitness, Eclipse, SVN.

We'd love your feedback!