Big Data Engineer Resume
Chicago, IL
SUMMARY
- 8+ years of overall software development experience on Big Data Technologies, Hadoop Eco system and SQL Knowledge with programming experience in Python, Scala, and Java.
- 3+ years of strong hands - on experience on Hadoop Ecosystem including Spark, Map-Reduce, Hive, Pig, HDFS, YARN, HBase, Oozie, Kafka and Sqoop.
- Experience in architecting, designing, and building distributed data pipelines.
- Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
- Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.
- Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.
- Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.
- Significant experience writing custom UDF’s in Hive and custom Input Formats in MapReduce.
- Knowledge of job workflow scheduling and monitoring tools like Oozie, Airflow.
- Experience using various Hadoop Distributions (Cloudera, Hortonworks, Amazon AWS EMR) to fully implement and utilize various Hadoop services.
- Experience working with NoSQL databases like MongoDB, Cassandra and HBase.
- Used Hive extensively for performing various data analytics required by business teams.
- Solid experience in working various data formats like Parquet, Orc, Avro, Json etc.,
- Good experience is designing and implementing end to end data security and governance within Hadoop Platform using Kerberos.
- Hands on experience in developing end to end Spark applications using Spark apis like RDD, Spark Data frame API, Spark MLLib, Spark Streaming and Spark SQL.
- Good experience working with various data analytics and big data services in AWS Cloud like EMR, Redshift, S3, Athena, Glue etc.,
- Good understanding of Spark ML algorithms such as Classification, Clustering, and Regression.
- Experienced in migrating data warehousing workloads into Hadoop based data lakes using MR, Hive, Pig and Sqoop.
- Setting up the build and deployment automation for Java base project by using Jenkins.
- Developed spark applications in python (Pyspark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Experience in maintaining an Apache Tomcat MYSQL, LDAP, Web service environment.
- Designed ETL workflows on Tableau, Deployed data from various sources to HDFS.
- Good experience with use-case development, with Software methodologies like Agile and Waterfall.
TECHNICAL SKILLS
BigData Technologies/ Hadoop Components: HDFS, Hue, MapReduce, Yarn, Sqoop, Pig, Hive, HBase, Oozie, Kafka, Impala, Zookeeper, Flume, Cloudera Manager, Airflow
Spark: Spark SQL, Spark Streaming, Data Frames, YARN, Pair RDD’s
Cloud Services: AWS (S3, EC2, EMR, Lambda, RedShift, Glue), Azure (Azure Data Factory / ETL / ELT / SSIS, Azure Data Lake Storage, Azure Databricks)
Programming Languages: SQL, PySpark, Python, Scala, Java
Databases: Oracle, MySQL, DB2, SQL Server, Teradata
NoSQL Databases: HBase, Cassandra, MongoDB
Web Technologies: HTML, JDBC, Java Script, CSS
Version Control Tools: GitHub, Bitbucket
Server-Side Scripting: UNIX Shell, Power Shell Scripting
IDE: Eclipse, PyCharm, Notepad++, IntelliJ, Visual Studio
Operating Systems: Linux, Unix, Ubuntu, Windows, Cent OS
PROFESSIONAL EXPERIENCE
Big Data Engineer
Confidential - Chicago, IL
Responsibilities:
- Working as a Data Engineer utilizing Big data & Hadoop Ecosystems components for building highly scalable data pipelines.
- Worked in Agile development environment and participated in daily scrum and other design related meetings.
- Involved in converting Hive/SQL queries into Spark transformations using PySpark.
- Worked on Spark SQL, created data frames by loading data from Hive tables and created prep data and stored in AWS S3.
- Responsible for loading the customer's data and event logs from Kafka into Redshift through spark streaming.
- Developed batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
- Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, Pyspark.
- Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using Pyspark.
- Analyzed the SQL scripts and designed it by using Pyspark SQL for faster performance.
- Created Sqoop Scripts to import and export customer profile data from RDBMS to S3 buckets.
- Built custom Input adapters to migrate click stream data from FTP servers to S3.
- Developed various enrichment applications in spark using scala for performing cleansing and enrichment of click stream data with customer profile lookups.
- Troubleshooting Spark applications for improved error tolerance and reliability.
- Used Spark Data frame and Spark API to implement batch processing of Jobs.
- Worked on fine tuning and performance enhancements of various spark applications and hive scripts.
- Used various concepts in spark like broadcast variables, caching, dynamic allocation etc to design more scalable spark applications.
- Implemented continuous integration and deployment using CI/CD tools like Jenkins, GIT, Maven.
Environment: AWS EMR, S3, Spark, Hive, Sqoop, Scala, Java, MySQL, Oracle DB, Athena, Redshift.
Spark Developer
Confidential - Minneapolis, MN
Responsibilities:
- DataIngestioninto the DataLake using Open-source Hadoop distribution to process Structured, Semi-Structured and Unstructured datasets.
- Expertise in Hive queries created user defined aggregated function worked on advanced optimization techniques and have extensive knowledge on joins.
- Create hive scripts to extract, transform, load (ETL) and store the data using Talend.
- Developed Sqoop Scripts to extract data from DB2 EDW source databases onto HDFS.
- Worked with Oracle and Teradata for data import/export operations from different data marts.
- Worked extensively with Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses.
- Hands on experience inAzureCloud, stored the data in ADLS - Azure Data Lake Storage,App services,Databricks cluster for running the jobs,Azure SQL Database,Virtual machines,Fabric controller,Azure AD, Azure search, and notification hub.
- Designed, configured, and deployed MicrosoftAzurefor a multitude of applications utilizing theAzurestack (Including Compute, Web & Mobile, Blobs, Resource Groups, Azure SQL, Cloud Services, and ARM), focusing on high - availability, fault tolerance, and auto-scaling.
- Developed Spark Application by using Python (Pyspark).
- Developed and implemented API services using Python in spark.
- CreatedPartitions,Bucketsbased on State to further process usingBucketbased Hive joins.
- Responsible for continuous monitoring and managing Elastic MapReduce cluster throughAWS console.
- ImplementedSparkusing Scala andSparkSQL for faster testing and processing of data.
- Knowledge on handling Hive queries using Spark SQL that integrate with Spark environment.
- Worked on migrating MapReduce programs intoSparktransformations usingSparkand Scala.
Environment: Hadoop, Hive, Talend, Map Reduce, Pig, Salesforce, SQOOP, Splunk, CDH5, Python, HDFS, Pig, DB2, Sqoop, Oozie, Putty, Java.
Data Engineer
Confidential - Scottsdale, AZ
Responsibilities:
- Build the new universes in Business Objects as per the user requirements by identifying the required tables from Data mart and by defining the universe connections.
- Used Business Objects to create reports based on SQL-queries. Generated executive dashboard reports with latest company financial data by business unit and by product.
- Performed the data analysis and mapping database normalization, performance tuning, query optimization data extraction, transfer, and loading (ETL) and clean up.
- Implemented Teradata RDBMS analysis with Business Objects to develop reports, interactive drill charts, balanced scorecards, and dynamic Dashboards.
- Responsible for requirements gathering, status reporting, creating various metrics, projects deliverables.
- Developed PL/SQL Procedures, Functions and Packages and used SQL loader to load data into the database.
- Designed and developed Informatica Mappings to load data from Source systems. Worked on Informatics Power Center tool - Source Analyzer, Data warehousing designer, Mapping Mapplet Designer and Transformation Designer.
- Involved in migrating warehouse database from Oracle 9i to 10g database.
- Involved in analyzing and adding new features of Oracle 10g like DBMS SHEDULER create directory, data pump, CONNECT BY ROOT in existing Oracle 9i application.
- Tuned Report performance by exploiting the Oracle's new built-in functions and rewriting SQL statements.
Environment: SQL Server, J2EE, UNIX, .Net, MS Project, Oracle, Web Logic, Shell script, JavaScript, HTML, Microsoft Office Suite 2010, Excel
Hadoop Developer
Confidential
Responsibilities:
- Involved in Requirements Analysis and design an Object-oriented domain model.
- Designed Use Case Diagrams, Class Diagrams and Sequence Diagrams and Object.
- Involved in designing user screens using HTML as per user requirements.
- Used Spring-Hibernate integration in the back end to fetch data from Oracle and MYSQL databases.
- Used Spring Dependency Injection properties to provide loose-coupling between layers.
- Implemented the Web Service client for the login authentication, credit reports and applicant information.
- Used Web services (SOAP) for transmission of large blocks of XML data over HTTP.
- Implemented the logging mechanism using Log4j framework.
- Wrote test cases in JUnit for unit testing of classes.
- Developed application to be implemented on Windows XP.
- Created application using Eclipse IDE.
- Installed Web Logic Server for handling HTTP Request/Response.
- Used Subversion for version control and created automated build scripts.